Skip to content

Implement adaptive libp2p resource management with rate limiting #5322

@gacevicljubisa

Description

@gacevicljubisa

Summary

We should enhance the existing libp2p resource management and rate limiting to support adaptive scaling based on hardware capabilities. Currently, resource limits are hard-coded and static (5000 incoming streams, 10000 outgoing streams, 200 connections per IP). This enhancement would make limits dynamic and adaptive—scaling from resource-constrained devices to high-performance servers—while adding a soft connection trimmer and optional private network support.

Motivation

Bee nodes currently have basic rate limiting and per-IP connection limits, but they don't adapt to hardware capabilities:

  1. Static Limits Don't Scale: Hard-coded stream limits (5000/10000) and per-IP limits (200 connections) are the same for a 1GB Raspberry Pi and a 32GB server. A Raspberry Pi could be under-utilized or overloaded, while a server is artificially constrained.
  2. No Soft Limit Management: Nodes hit hard resource limits abruptly with no graceful degradation. There's no connection trimmer to reduce load before hitting system limits, leading to sudden disconnections.
  3. Limited Rate Limiting: While per-IP rate limiting exists (10 conn/sec, burst 40), it's not integrated with the overall resource strategy or configurable per deployment.
  4. Private Network Limitations: Private IP ranges (10.x, 192.168.x) cannot be exempted from rate limits, breaking local cluster deployments that developers need for testing.
  5. No Bootnode Prioritization: Bootnodes aren't automatically exempted from rate limits, potentially causing bootstrap failures under high load.
  6. No Protocol Prioritization: All protocols consume resources equally; critical protocols like Hive could be starved by background traffic.

Implementation

  1. Hardware-Adaptive Scaling:

    • Move from hard-coded limits to auto-scaled limits based on available system memory
    • Define reasonable base limits for constrained devices
    • Scale limits proportionally upward as more memory becomes available
    • Preserve the existing rate limiting approach but integrate it with dynamic system limits
  2. Improved Per-IP Connection Limits:

    • Replace fixed 200-per-IP limit with dynamic calculation based on total system connections
    • Scale fairly: smaller nodes get smaller per-IP allowances, larger servers allow more
    • Maintain the existing IPv4 /32 and IPv6 /56 subnet-based approach
  3. Connection Manager (Soft Limits):

    • Add a connection manager that trims excess connections before hitting hard limits
    • Implement configurable grace periods to protect new connections
    • Use hysteresis to prevent rapid connection cycling under load
    • This prevents the abrupt failures that currently occur when limits are exceeded
  4. Bootnode Allowlisting:

    • Automatically exempt bootnode multiaddrs from rate limits
    • Ensure reliable bootstrap connectivity regardless of load conditions
    • Gracefully handle invalid bootnode addresses
  5. Private CIDR Support (new optional flag):

    • Add --allow-private-cidrs flag to exempt private IP ranges from rate limiting
    • Enable local cluster deployments and development setups
    • Disabled by default for security on public nodes
  6. Encapsulated Resource Manager:

    • Extract resource manager configuration into a separate module for maintainability
    • Centralize all resource limit logic in one place
    • Prepare codebase for future enhancements (e.g., per-protocol limits commented as example)

Key Differences from Current Implementation:

  • Replaces static stream limits (5000/10000) with adaptive scaling
  • Replaces static per-IP limit (200) with dynamic calculation
  • Adds connection manager for graceful load management
  • Adds bootnode allowlisting for reliable bootstrap
  • Adds optional private network support
  • Consolidates resource manager logic into dedicated module

Drawbacks

  1. Configuration Complexity: Understanding how system resources map to connection limits requires more knowledge. Operators need to be aware of the auto-scaling behavior.

  2. Testing Requirements: Validation across diverse hardware profiles and network conditions is important to ensure limits work correctly in varied deployments.

  3. Tuning Uncertainty: Initial scaling factors are educated estimates. Real-world deployments may reveal suboptimal values requiring adjustments.

  4. Private CIDR Security: The --allow-private-cidrs flag, if accidentally enabled on public nodes, bypasses rate limits for entire private ranges. Clear documentation and warnings are needed.

  5. Soft Limit Interactions: Connection trimming behavior adds complexity and requires careful testing to ensure it doesn't cause unintended disconnections.

  6. Upgrade Impact: Nodes will enforce new adaptive limits during upgrade, potentially causing temporary connection fluctuations. Network stability during the upgrade window requires monitoring.

  7. Memory Calculation Variability: Auto-scaling based on system memory may be inaccurate for containerized deployments or NUMA systems, potentially requiring manual calibration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions