Skip to content

Conversation

@guanzhenxing
Copy link
Contributor

Problem Description

When the Admin side health check detects an upstream as unhealthy and publishes a configuration update with status=false, the Gateway side completely removes that upstream from both healthyUpstream and unhealthyUpstream maps. This causes the Gateway to lose track of the upstream, preventing its independent health check from recovering the upstream when it becomes healthy again.

Error Manifestation

divide upstream configuration error
CANNOT_FIND_HEALTHY_UPSTREAM_URL

Root Cause

ShenYu has two independent health check systems:

  1. Admin side: Checks upstream health and publishes configuration updates
  2. Gateway side: Independently runs health checks and maintains its own health state

The issue occurs when:

  1. Admin's health check marks an upstream as unhealthy (status=false)
  2. Admin publishes a configuration update
  3. Gateway receives the update via DivideUpstreamDataHandler
  4. UpstreamCacheManager.submit() processes status=false upstreams
  5. Bug: Original code calls triggerRemoveOne() which removes the upstream from BOTH healthy and unhealthy maps
  6. Result: Gateway loses all tracking of this upstream - even if it recovers, Gateway won't know

Solution

Design Principle

Gateway's health check state should be independent of Admin's configuration updates

Core Logic Change

Before:
status=false → triggerRemoveOne() → completely removed from both maps

After:
status=false AND healthCheckEnabled=true → preserve in unhealthy map → continue health checking
status=false AND healthCheckEnabled=false → remove (no monitoring needed)

Changes

  1. UpstreamCacheManager.java

Refactored submit() method

  • Extracted logic into smaller, focused methods for better maintainability
  • Fixed ConcurrentModificationException by creating ArrayList copy before iteration

New method: processOfflineUpstreams()

Handles upstreams with status=false:

  // If upstream was previously in unhealthy map AND health check is enabled:
  //   → Keep it in unhealthy map for continued monitoring
  // If upstream was not previously unhealthy OR health check is disabled:
  //   → Remove it (no monitoring needed)

New method: processValidUpstreams()

Handles upstreams with status=true:

  • Checks if upstream was previously in unhealthyUpstream map
  • If yes, preserves the unhealthy state instead of forcing it to healthy
  • This allows Gateway's health check to recover it naturally

New method: getCurrentUnhealthyMap()

Helper method to get current unhealthy upstreams for state preservation

  1. UpstreamCheckTask.java

Made putToMap() and removeFromMap() public

  • These methods were private but are needed by UpstreamCacheManager
  • Now allows preserving unhealthy state across configuration updates

Testing

Added 9 comprehensive tests to verify the fix:

UpstreamCacheManagerTest (4 new tests)

  1. testSubmitWithStatusFalsePreservesUnhealthyState: Verifies upstreams with status=false that were previously unhealthy remain in unhealthy map
  2. testSubmitWithNewOfflineUpstreamAddedToUnhealthy: Verifies new upstreams with status=false are added to unhealthy map for monitoring
  3. testSubmitPreservesUnhealthyForValidUpstream: Verifies valid upstreams (status=true) that were previously unhealthy remain in unhealthy map
  4. testSubmitWithHealthCheckDisabledAndStatusFalse: Verifies upstreams with healthCheckEnabled=false are removed, not added to unhealthy map

UpstreamCheckTaskTest (5 new tests)

  1. testPutToMap: Tests adding upstreams to healthy map
  2. testPutToMapUnhealthy: Tests adding upstreams to unhealthy map
  3. testRemoveFromMap: Tests removing upstreams from healthy map
  4. testRemoveFromMapUnhealthy: Tests removing upstreams from unhealthy map
  5. testMoveUpstreamBetweenMaps: Tests moving upstreams between healthy and unhealthy maps

Test Results

Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS

Impact

Before Fix

  • Gateway loses unhealthy upstream tracking when Admin publishes updates
  • Recovered upstreams cannot be detected by Gateway
  • Results in CANNOT_FIND_HEALTHY_UPSTREAM_URL errors

After Fix

  • Gateway preserves its independent health check state
  • Unhealthy upstreams continue to be monitored even after Admin updates
  • Gateway can automatically recover upstreams when they become healthy
  • No manual intervention required

Commits

  • 8a0f9e9 - Fix: Preserve unhealthy upstream state when receiving config updates from admin
  • 78822b4 - Test: Add tests for upstream unhealthy state preservation

guanzhenxing and others added 3 commits January 19, 2026 14:52
…from admin

When admin publishes configuration updates with upstreams marked as status=false,
the gateway should preserve their unhealthy state and continue health checking
instead of completely removing them. This allows the gateway's independent health
check to recover upstreams when they become healthy.

Changes:
- UpstreamCacheManager: Refactored submit() method to preserve unhealthy state
  for both status=true and status=false upstreams
- Added processOfflineUpstreams() to handle status=false upstreams with health
  check enabled, keeping them in unhealthy map for monitoring
- Added processValidUpstreams() to check if valid upstreams were previously
  unhealthy and preserve that status
- UpstreamCheckTask: Made removeFromMap() public to support state preservation

Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive tests to verify the fix for preserving unhealthy upstream
state when receiving config updates from admin.

UpstreamCacheManagerTest:
- testSubmitWithStatusFalsePreservesUnhealthyState: Verify that upstreams
  with status=false that were previously unhealthy remain in unhealthy map
- testSubmitWithNewOfflineUpstreamAddedToUnhealthy: Verify new upstreams
  with status=false are added to unhealthy map for monitoring
- testSubmitPreservesUnhealthyForValidUpstream: Verify valid upstreams
  that were previously unhealthy remain in unhealthy map
- testSubmitWithHealthCheckDisabledAndStatusFalse: Verify upstreams with
  healthCheckEnabled=false are removed, not added to unhealthy map

UpstreamCheckTaskTest:
- testPutToMap: Test adding upstreams to healthy map
- testPutToMapUnhealthy: Test adding upstreams to unhealthy map
- testRemoveFromMap: Test removing upstreams from healthy map
- testRemoveFromMapUnhealthy: Test removing upstreams from unhealthy map
- testMoveUpstreamBetweenMaps: Test moving upstreams between maps

Co-Authored-By: Claude <noreply@anthropic.com>
@Aias00 Aias00 requested a review from Copilot January 19, 2026 10:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +264 to +265
assertTrue(!healthCheckTask.getUnhealthyUpstream().containsKey(selectorId)
|| healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty());
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic uses double negation which reduces code readability. Instead of checking !healthCheckTask.getUnhealthyUpstream().containsKey(selectorId) || healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty(), consider using a more straightforward assertion that directly validates the absence of upstreams in the unhealthy map.

Copilot uses AI. Check for mistakes.
Comment on lines +288 to +289
assertTrue(!healthCheckTask.getHealthyUpstream().containsKey(selectorId)
|| healthCheckTask.getHealthyUpstream().get(selectorId).isEmpty());
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic uses double negation which reduces code readability. Instead of checking !healthCheckTask.getHealthyUpstream().containsKey(selectorId) || healthCheckTask.getHealthyUpstream().get(selectorId).isEmpty(), consider using a more straightforward assertion that directly validates the absence of upstreams in the healthy map.

Copilot uses AI. Check for mistakes.
Comment on lines +298 to +299
assertTrue(!healthCheckTask.getUnhealthyUpstream().containsKey(selectorId)
|| healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty());
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic uses double negation which reduces code readability. Instead of checking !healthCheckTask.getUnhealthyUpstream().containsKey(selectorId) || healthCheckTask.getUnhealthyUpstream().get(selectorId).isEmpty(), consider using a more straightforward assertion that directly validates the absence of upstreams in the unhealthy map.

Copilot uses AI. Check for mistakes.
…er/cache/UpstreamCheckTaskTest.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants