Add comprehensive monitoring infrastructure with cache metrics and resource limits #23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Features and Improvements
🎯 Cache Metrics Monitoring
Integration of Redis Exporter (port 9121) to collect Redis metrics
Integration of Memcached Exporter (port 9150) to collect Memcached metrics
Addition of Telegraf for scraping Prometheus metrics from exporters
Telegraf configuration (
Cache metrics now sent to InfluxDB every 10 seconds
📊 Enhanced Grafana Dashboard
New "Cache Hit/Miss Rate Over Time" panel showing hit/miss rates in real-time
4 new stat panels:
Redis Cache Hit Rate
Redis Cache Miss Rate
Memcached Cache Hit Rate
Memcached Cache Miss Rate
Improved visual thresholds for response times (P95 and P99):
🟢 Green (0-50ms): Excellent
🟡 Yellow (50-100ms): Good
🟠 Orange (100-150ms): Attention needed
🔴 Red (>150ms): Problematic
Fixed min/max limits for success rate gauge (0-1)
⚙️ Resource Limits
Defined CPU and memory limits for all containers:
App: 1 CPU, 1GB RAM
Redis: 0.25 CPU, 256MB RAM
Memcached: 0.25 CPU, 128MB RAM
InfluxDB: 0.5 CPU, 512MB RAM
Grafana: 0.5 CPU, 512MB RAM
K6: 0.5 CPU, 256MB RAM
Redis Exporter: 0.1 CPU, 64MB RAM
Memcached Exporter: 0.1 CPU, 64MB RAM
Telegraf: 0.25 CPU, 256MB RAM
📚 Expanded Documentation
Complete rewrite of [benchmark/README.md]
Detailed explanation of monitoring architecture
Documentation about percentiles (P95, P99) and their importance
Visual diagram of metrics flow
Troubleshooting guide for common issues
Instructions for viewing cache metrics
Explanation of visual thresholds in Grafana
🔧 Infrastructure Improvements
Upgraded InfluxDB to version 1.12.2 (previously 1.8)
Disabled Grafana default password change enforcement for development environment
Added appropriate health checks and dependencies between containers