Skip to content

Conversation

@HaoTien
Copy link

@HaoTien HaoTien commented Nov 21, 2025

feat: Add error ratio-based circuit breaking policy to api-breaker plugin

What this PR does / why we need it

This PR implements error ratio-based circuit breaking (unhealthy-ratio policy) for the api-breaker plugin, providing more intelligent and adaptive circuit breaking behavior based on error rates within a sliding time window, rather than just consecutive failure counts.

Closes #12763

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Documentation update

Description

Current Limitations

  • The existing failure count-based approach only considers consecutive failures
  • It doesn't account for the overall error rate in relation to total requests
  • May be too sensitive during low traffic periods or not sensitive enough during high traffic periods

New Features Added

  • Error ratio-based circuit breaking: New unhealthy-ratio policy that triggers circuit breaker based on error rate within a sliding time window
  • Configurable parameters: Support for error ratio threshold, minimum request threshold, sliding window size, etc.
  • Circuit breaker states: Proper implementation of CLOSED, OPEN, and HALF_OPEN states
  • Backward compatibility: Existing configurations continue to work without changes

New Configuration Parameters

Parameter Type Default Description
policy string "unhealthy-count" Circuit breaker policy
unhealthy.error_ratio number 0.5 Error rate threshold (0-1) to trigger circuit breaker
unhealthy.min_request_threshold integer 10 Minimum requests needed before evaluating error rate
unhealthy.sliding_window_size integer 300 Sliding window size in seconds for error rate calculation
unhealthy.permitted_number_of_calls_in_half_open_state integer 3 Number of permitted calls in half-open state
healthy.success_ratio number 0.6 Success rate threshold to close circuit breaker from half-open state

Example Configuration

{
  "plugins": {
    "api-breaker": {
      "break_response_code": 503,
      "policy": "unhealthy-ratio",
      "max_breaker_sec": 60,
      "unhealthy": {
        "http_statuses": [500, 502, 503, 504],
        "error_ratio": 0.5,
        "min_request_threshold": 10,
        "sliding_window_size": 300,
        "permitted_number_of_calls_in_half_open_state": 3
      },
      "healthy": {
        "http_statuses": [200, 201, 202],
        "success_ratio": 0.6
      }
    }
  }
}

How Has This Been Tested?

  • Schema validation tests for new parameters
  • Functional tests for error ratio calculation
  • Circuit breaker state transition tests
  • Integration tests with various traffic patterns
  • Backward compatibility tests
  • Performance tests to ensure no regression

Test Results

# Run the new test file
prove -I. -r t/plugin/api-breaker2.t

# Verify existing tests still pass
prove -I. -r t/plugin/api-breaker.t

Files Modified

  • apisix/plugins/api-breaker.lua - Core plugin logic with new ratio-based policy
  • t/plugin/api-breaker2.t - New comprehensive test file for ratio-based circuit breaking
  • docs/en/latest/plugins/api-breaker.md - Updated English documentation
  • docs/zh/latest/plugins/api-breaker.md - Updated Chinese documentation

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have read the CONTRIBUTING document
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have squashed my commits into logical units
  • My commit messages are in the proper format

Additional Notes

This implementation:

  • Maintains full backward compatibility - existing configurations work unchanged
  • Follows APISIX patterns - consistent with existing plugin architecture
  • Comprehensive testing - covers all scenarios and edge cases
  • Performance optimized - efficient sliding window implementation
  • Well documented - updated both English and Chinese docs

The feature addresses real-world use cases for:

  • High-traffic services with better error spike handling
  • Variable traffic patterns with adaptive behavior
  • Microservices architectures requiring precise circuit breaking
  • SLA-based circuit breaking with configurable error rates

Ready for review and feedback!

…ugin

- Add new 'unhealthy-ratio' policy that triggers circuit breaker based on error rate within sliding time window
- Implement three-state circuit breaker: CLOSED -> OPEN -> HALF_OPEN -> CLOSED
- Add configurable parameters: error_ratio, min_request_threshold, sliding_window_size, permitted_number_of_calls_in_half_open_state, success_ratio
- Maintain full backward compatibility with existing 'unhealthy-count' policy as default
- Add comprehensive test coverage for new functionality
- Update documentation in both Chinese and English
- Follow APISIX coding standards and testing conventions

This enhancement provides more intelligent circuit breaking for microservices architectures by considering error rates rather than just consecutive failure counts.
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. doc Documentation things enhancement New feature or request labels Nov 21, 2025
Copy link
Contributor

@Baoyuantop Baoyuantop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Based on the current configuration, we need to add some test cases:

  1. After the sliding window time (sliding_window_size) expires, are the statistics (total number of requests, number of failures) correctly cleared?

  2. Failure fallback in half-open state (Half-Open -> Open)

  3. Sending more requests than permitted_number_of_calls_in_half_open_state in half-open state

description = "Size of the sliding window in seconds"
},
default = {http_statuses = {500}, failures = 3}
permitted_number_of_calls_in_half_open_state = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to optimize this variable name?

Comment on lines +446 to +459
=== TEST $((${1}+1)): hit route (return 200)
--- request
GET /api_breaker
--- response_body
hello world



=== TEST $((${1}+1)): hit route and return 500 (first failure)
--- request
GET /api_breaker?code=500
--- error_code: 500
--- response_body
fault injection!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make multiple requests in a single case; you can refer to the tests in api-breaker.t

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask if there is an official test image of apisix? It is very difficult to set up the environment for testing .t files locally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Documentation things enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

2 participants