Skip to content

feat: Add StreamMetadataProvider caching to prevent port exhaustion#16623

Closed
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:cache-kafka-client
Closed

feat: Add StreamMetadataProvider caching to prevent port exhaustion#16623
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:cache-kafka-client

Conversation

@xiangfu0
Copy link
Contributor

Problem

PR #15393 introduced unique client IDs to resolve JMX conflicts, but this created too many open Kafka connections/ports for Pinot servers, leading to resource exhaustion on high-traffic clusters.

Solution

Implements comprehensive caching for StreamMetadataProvider instances that:

  • Reuses connections while maintaining unique client ID benefits
  • Prevents orphan providers with explicit cleanup on recreation
  • Ensures thread safety with synchronized operations
  • Provides graceful shutdown with automatic cleanup hooks

Key Features

🔧 StreamMetadataProviderCacheManager

  • Singleton cache manager with separate caches for stream and partition providers
  • Configurable cache expiry (30 minutes) and size limits (1000 entries)
  • Automatic cleanup using Guava RemovalListeners
  • Base client ID extraction for proper caching despite unique suffixes

🔄 Enhanced StreamConsumerFactory

  • createCachedStreamMetadataProvider() - Get cached or create new stream provider
  • createCachedPartitionMetadataProvider() - Get cached or create new partition provider
  • recreateCachedStreamMetadataProvider() - Force recreation with explicit cleanup
  • recreateCachedPartitionMetadataProvider() - Force recreation with explicit cleanup

🛡️ Orphan Prevention

  • Explicit old provider cleanup before creating new ones during recreation
  • Synchronized recreation methods to prevent race conditions
  • Enhanced clearAll() that closes all providers before clearing cache
  • Shutdown hooks for graceful cleanup on application termination

📊 Updated Client Classes

  • PartitionGroupMetadataFetcher
  • PinotLLCRealtimeSegmentManager
  • RealtimeConsumptionRateManager
  • StreamMetadataProvider.computePartitionGroupMetadata

Testing

Includes comprehensive unit tests:

  • Cache reuse verification
  • Orphan prevention validation
  • Recreation cleanup testing
  • Thread safety verification

Benefits

Reduced Port Usage: Reuses existing connections instead of creating new ones
Better Performance: Cached connections avoid establishment overhead
Resource Efficiency: Prevents resource exhaustion on high-traffic clusters
Maintained Reliability: Still recreates connections on failures
Backward Compatibility: Original methods continue to work

Impact

Resolves port exhaustion while maintaining all benefits of unique client IDs from PR #15393. Production-ready with comprehensive error handling and monitoring.

Related: Addresses issues introduced in #15393

Implements comprehensive caching solution for StreamMetadataProvider instances
to address port exhaustion issues caused by random client IDs from PR apache#15393.

Key improvements:
* Add StreamMetadataProviderCacheManager with intelligent caching
* Cache providers by table+topic+partition to enable reuse
* Add explicit cleanup on recreation to prevent orphan providers
* Implement synchronized recreation methods for thread safety
* Add shutdown hooks for graceful cleanup on app termination
* Update main client classes to use cached providers:
  - PartitionGroupMetadataFetcher
  - PinotLLCRealtimeSegmentManager
  - RealtimeConsumptionRateManager
  - StreamMetadataProvider.computePartitionGroupMetadata

Benefits:
* Reduces Kafka connection count and prevents port exhaustion
* Maintains unique client ID benefits for failure isolation
* Provides automatic recreation on failures for robustness
* Ensures proper resource cleanup with no orphaned providers
* Backwards compatible with existing code

Includes comprehensive unit tests for cache functionality and
proper provider lifecycle management.
@xiangfu0 xiangfu0 closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant