-
Notifications
You must be signed in to change notification settings - Fork 12
Add metrics and observability instrumentation #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement Phase 1 of issue #403 with Prometheus metrics for production monitoring. Adds optional --metrics-addr flag to expose metrics endpoint (/metrics), health checks (/health, /ready), and instruments backend operations, template processing, command execution, and file sync operations with duration histograms and success/error counters. All metrics are optional and disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements Phase 1 of observability instrumentation by adding Prometheus metrics support to confd. The changes enable optional metrics collection through a --metrics-addr flag, exposing an HTTP endpoint with Prometheus metrics, health checks, and readiness probes. The implementation instruments backend operations, template processing, command execution, and file synchronization with latency histograms and counters.
Changes:
- Added comprehensive Prometheus metrics instrumentation across backend, template, command, and file operations
- Implemented HTTP endpoints for metrics (
/metrics), liveness (/health), and readiness (/ready) probes - Integrated metrics configuration through CLI flag and TOML config file
Reviewed changes
Copilot reviewed 13 out of 235 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/metrics/metrics.go | Defines all Prometheus metrics and initialization logic |
| pkg/metrics/backend.go | Implements instrumented wrapper for backend store clients |
| pkg/metrics/health.go | Provides HTTP handlers for health and readiness probes |
| pkg/template/resource.go | Adds metrics tracking to template processing operations |
| pkg/template/command_executor.go | Instruments check and reload command execution with metrics |
| pkg/template/template_cache.go | Tracks template cache hits and misses |
| pkg/template/file_stager.go | Records file sync operations |
| cmd/confd/cli.go | Integrates metrics server startup and client wrapping |
| cmd/confd/config.go | Adds metrics_addr configuration option |
| go.mod | Adds Prometheus client library dependency |
| pkg/metrics/metrics_test.go | Tests for metrics initialization and recording |
| pkg/metrics/backend_test.go | Tests for backend instrumentation wrapper |
| pkg/metrics/health_test.go | Tests for health and readiness handlers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #416 +/- ##
==========================================
+ Coverage 61.79% 64.30% +2.51%
==========================================
Files 37 41 +4
Lines 3484 3858 +374
==========================================
+ Hits 2153 2481 +328
- Misses 1187 1220 +33
- Partials 144 157 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PR #416 Review: Add metrics and observability instrumentationSummaryThis PR implements Phase 1 of #403 by adding comprehensive Prometheus metrics instrumentation to confd. The implementation is well-structured, follows Go best practices, and maintains backward compatibility with zero overhead when disabled. ✅ StrengthsCode Quality & Architecture
Security & Safety
Test Coverage
Integration
🔍 Areas for Consideration1. Cardinality Concerns (Medium Priority)Issue: Several metrics use high-cardinality labels that could cause problems in production: // In command_executor.go:88,138
metrics.CommandDuration.WithLabelValues("check", e.dest).Observe(duration)
metrics.CommandTotal.WithLabelValues("check", e.dest).Inc()
// In file_stager.go:116
metrics.FileSyncTotal.WithLabelValues(destPath).Inc()
// In template processing
TemplateProcessDuration.WithLabelValues(dest).Observe(duration)Problem: If users have hundreds or thousands of unique destination files, this creates unbounded metric series. In a system managing 1000 config files × 2 command types × 2 metrics = 4000+ time series just for commands. Recommendation:
2. Metrics Server Lifecycle (Low Priority)Location: Issue: The metrics HTTP server is started in a goroutine without graceful shutdown handling. When confd exits, the server may not close cleanly. Recommendation: Add graceful shutdown using the existing context to properly close the server on SIGTERM/SIGINT. 3. Histogram Bucket Selection (Low Priority)Issue: All histograms use Consideration:
Recommendation: Consider custom buckets per metric type to better capture actual latency distributions. 4. Error Message Exposure (Security - Low Risk)Location: w.Write([]byte("backend unhealthy: " + err.Error()))Issue: Error messages might expose internal details (connection strings, paths, etc.). Recommendation: Log the full error but return a generic message to external callers. 5. Missing Metric DocumentationIssue: The PR description lists available metrics, but there's no in-code documentation for operators. Recommendation: Add a 🎯 Performance ConsiderationsPositive
Potential Concerns
📊 Test Coverage AssessmentCoverage appears excellent:
🔐 Security AssessmentNo major security concerns identified:
Recommendations:
✅ Recommendations SummaryShould Fix Before Merge
Could Fix in Follow-up PR
🎉 ConclusionThis is an excellent implementation of metrics instrumentation. The code quality is high, the design is non-invasive, and the test coverage is comprehensive. The main concern is the high-cardinality labels which could cause issues in large deployments, but this is easily addressable with documentation or configuration options. Recommendation: ✅ Approve with minor suggestions The PR is ready to merge after addressing the two "should fix" items (documentation and graceful shutdown). The other suggestions can be addressed in follow-up PRs. |
… gosec Add error handling for: - http.ResponseWriter.Write() calls in health and ready endpoints - temp.Close() and os.Remove() calls in file_stager cleanup paths These changes address gosec G104 (CWE-703) warnings about unhandled errors. While these errors are typically non-critical (write failures in HTTP handlers, cleanup failures in error paths), proper error handling improves observability and satisfies security scanning requirements.
…s attacks Configure the metrics HTTP server with a 10-second ReadHeaderTimeout to mitigate potential Slowloris denial-of-service attacks. This addresses gosec G112 (CWE-400) security warning.
Summary
Implements Phase 1 of #403: adds Prometheus metrics instrumentation for production observability. Includes optional
--metrics-addrflag to expose metrics endpoint, plus/healthand/readyprobes. Instruments backend operations, template processing, command execution, and file sync with latency histograms and success/error counters.Testing
Metrics Available
Backend:
confd_backend_request_duration_seconds,confd_backend_request_total,confd_backend_errors_total,confd_backend_healthyTemplate:
confd_template_process_duration_seconds,confd_template_process_total,confd_template_cache_hits_total,confd_template_cache_misses_totalCommands:
confd_command_duration_seconds,confd_command_total,confd_command_exit_code_totalFiles:
confd_file_sync_total,confd_file_changed_total