Add metrics and observability instrumentation #416

abtreece · 2026-01-14T17:57:08Z

Summary

Implements Phase 1 of #403: adds Prometheus metrics instrumentation for production observability. Includes optional --metrics-addr flag to expose metrics endpoint, plus /health and /ready probes. Instruments backend operations, template processing, command execution, and file sync with latency histograms and success/error counters.

Testing

All 24 metrics tests pass
All existing tests pass (no breaking changes)
Build succeeds without errors
Metrics disabled by default (zero overhead)

Metrics Available

Backend: confd_backend_request_duration_seconds, confd_backend_request_total, confd_backend_errors_total, confd_backend_healthy

Template: confd_template_process_duration_seconds, confd_template_process_total, confd_template_cache_hits_total, confd_template_cache_misses_total

Commands: confd_command_duration_seconds, confd_command_total, confd_command_exit_code_total

Files: confd_file_sync_total, confd_file_changed_total

Implement Phase 1 of issue #403 with Prometheus metrics for production monitoring. Adds optional --metrics-addr flag to expose metrics endpoint (/metrics), health checks (/health, /ready), and instruments backend operations, template processing, command execution, and file sync operations with duration histograms and success/error counters. All metrics are optional and disabled by default.

Copilot

Pull request overview

This PR implements Phase 1 of observability instrumentation by adding Prometheus metrics support to confd. The changes enable optional metrics collection through a --metrics-addr flag, exposing an HTTP endpoint with Prometheus metrics, health checks, and readiness probes. The implementation instruments backend operations, template processing, command execution, and file synchronization with latency histograms and counters.

Changes:

Added comprehensive Prometheus metrics instrumentation across backend, template, command, and file operations
Implemented HTTP endpoints for metrics (/metrics), liveness (/health), and readiness (/ready) probes
Integrated metrics configuration through CLI flag and TOML config file

Reviewed changes

Copilot reviewed 13 out of 235 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/metrics/metrics.go	Defines all Prometheus metrics and initialization logic
pkg/metrics/backend.go	Implements instrumented wrapper for backend store clients
pkg/metrics/health.go	Provides HTTP handlers for health and readiness probes
pkg/template/resource.go	Adds metrics tracking to template processing operations
pkg/template/command_executor.go	Instruments check and reload command execution with metrics
pkg/template/template_cache.go	Tracks template cache hits and misses
pkg/template/file_stager.go	Records file sync operations
cmd/confd/cli.go	Integrates metrics server startup and client wrapping
cmd/confd/config.go	Adds metrics_addr configuration option
go.mod	Adds Prometheus client library dependency
pkg/metrics/metrics_test.go	Tests for metrics initialization and recording
pkg/metrics/backend_test.go	Tests for backend instrumentation wrapper
pkg/metrics/health_test.go	Tests for health and readiness handlers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cmd/confd/cli.go

pkg/metrics/health.go

codecov · 2026-01-14T18:00:02Z

Codecov Report

❌ Patch coverage is 72.40000% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.30%. Comparing base (4bbca52) to head (a6bd821).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/template/command_executor.go	36.00%	14 Missing and 2 partials ⚠️
cmd/confd/cli.go	0.00%	15 Missing ⚠️
pkg/template/file_stager.go	0.00%	10 Missing and 1 partial ⚠️
pkg/template/resource.go	35.29%	6 Missing and 5 partials ⚠️
pkg/metrics/health.go	64.70%	3 Missing and 3 partials ⚠️
pkg/template/template_cache.go	0.00%	3 Missing and 3 partials ⚠️
cmd/confd/config.go	0.00%	1 Missing and 1 partial ⚠️
pkg/metrics/backend.go	94.11%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   61.79%   64.30%   +2.51%     
==========================================
  Files          37       41       +4     
  Lines        3484     3858     +374     
==========================================
+ Hits         2153     2481     +328     
- Misses       1187     1220      +33     
- Partials      144      157      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2026-01-14T18:00:23Z

PR #416 Review: Add metrics and observability instrumentation

Summary

This PR implements Phase 1 of #403 by adding comprehensive Prometheus metrics instrumentation to confd. The implementation is well-structured, follows Go best practices, and maintains backward compatibility with zero overhead when disabled.

✅ Strengths

Code Quality & Architecture

Excellent separation of concerns: The pkg/metrics/ package is cleanly isolated with minimal coupling to existing code
Zero-overhead when disabled: The Enabled() guard pattern ensures no performance impact when metrics are not in use
Decorator pattern: The InstrumentedClient wrapper for backends is a textbook example of non-invasive instrumentation
Consistent metric naming: All metrics follow the Prometheus naming convention (confd_<subsystem>_<metric>_<unit>)
Well-documented: Code includes clear comments explaining purpose and behavior

Security & Safety

No security vulnerabilities identified: The metrics server is opt-in and doesn't expose sensitive data
Proper context handling: Timeouts are enforced on health checks (5s)
Graceful degradation: If metrics are disabled mid-operation, no panics occur due to nil checks

Test Coverage

Comprehensive unit tests: 24+ tests covering initialization, metrics recording, health endpoints, and edge cases
Good mock usage: Tests properly isolate components with mock implementations
Edge cases covered: Tests verify behavior with metrics enabled/disabled, backend failures, etc.

Integration

Non-breaking changes: All existing functionality preserved, metrics are purely additive
Proper CLI flag: --metrics-addr with env var support (CONFD_METRICS_ADDR)
Health probes: Both /health (liveness) and /ready (readiness) endpoints provided for k8s integration

🔍 Areas for Consideration

1. Cardinality Concerns (Medium Priority)

Issue: Several metrics use high-cardinality labels that could cause problems in production:

// In command_executor.go:88,138
metrics.CommandDuration.WithLabelValues("check", e.dest).Observe(duration)
metrics.CommandTotal.WithLabelValues("check", e.dest).Inc()

// In file_stager.go:116
metrics.FileSyncTotal.WithLabelValues(destPath).Inc()

// In template processing
TemplateProcessDuration.WithLabelValues(dest).Observe(duration)

Problem: If users have hundreds or thousands of unique destination files, this creates unbounded metric series. In a system managing 1000 config files × 2 command types × 2 metrics = 4000+ time series just for commands.

Recommendation:

Consider bucketing destinations by pattern (e.g., /etc/app/*.conf → etc_app)
Or make high-cardinality labels optional via config flag
Add documentation warning about cardinality in environments with many templates

2. Metrics Server Lifecycle (Low Priority)

Location: cmd/confd/cli.go:383-394

Issue: The metrics HTTP server is started in a goroutine without graceful shutdown handling. When confd exits, the server may not close cleanly.

Recommendation: Add graceful shutdown using the existing context to properly close the server on SIGTERM/SIGINT.

3. Histogram Bucket Selection (Low Priority)

Issue: All histograms use prometheus.DefBuckets which are designed for typical HTTP request durations.

Consideration:

Backend operations might have different latency profiles (especially for slow backends like DynamoDB, SSM)
Command execution (especially reload commands) might take much longer

Recommendation: Consider custom buckets per metric type to better capture actual latency distributions.

4. Error Message Exposure (Security - Low Risk)

Location: pkg/metrics/health.go:31

w.Write([]byte("backend unhealthy: " + err.Error()))

Issue: Error messages might expose internal details (connection strings, paths, etc.).

Recommendation: Log the full error but return a generic message to external callers.

5. Missing Metric Documentation

Issue: The PR description lists available metrics, but there's no in-code documentation for operators.

Recommendation: Add a pkg/metrics/README.md with metric descriptions, example PromQL queries, and alerting rule suggestions.

🎯 Performance Considerations

Positive

✅ Metrics are disabled by default (zero overhead)
✅ Label values are pre-computed (no string formatting in hot path)
✅ Registry is separate from global Prometheus registry (no conflicts)
✅ No blocking operations in metric recording

Potential Concerns

⚠️ High-cardinality labels could cause memory pressure with many destinations
⚠️ No rate limiting on metrics endpoint (low risk, standard Prometheus pattern)

📊 Test Coverage Assessment

Coverage appears excellent:

✅ Initialization and registration tested
✅ Enabled/disabled state transitions tested
✅ All metric types covered (counters, histograms, gauges)
✅ Health endpoints tested for success and failure cases
✅ Wrapper behavior tested (passthrough when disabled)

🔐 Security Assessment

No major security concerns identified:

✅ Metrics endpoint is opt-in and requires explicit configuration
✅ No authentication on metrics endpoint (standard Prometheus pattern - should be firewalled)
✅ No user input in metric labels (all internal values)
⚠️ Minor: Error messages in health endpoint could be more generic

Recommendations:

Document that --metrics-addr should only bind to internal networks
Add note about firewall rules in deployment docs

✅ Recommendations Summary

Should Fix Before Merge

Cardinality warning in documentation: Add a note about high-cardinality labels and their implications
Metrics server graceful shutdown: Implement proper cleanup on exit

Could Fix in Follow-up PR

Custom histogram buckets for different operation types
Template cache size metric
Generic error messages in health endpoint
Comprehensive metrics documentation (README/dashboard)

🎉 Conclusion

This is an excellent implementation of metrics instrumentation. The code quality is high, the design is non-invasive, and the test coverage is comprehensive. The main concern is the high-cardinality labels which could cause issues in large deployments, but this is easily addressable with documentation or configuration options.

Recommendation: ✅ Approve with minor suggestions

The PR is ready to merge after addressing the two "should fix" items (documentation and graceful shutdown). The other suggestions can be addressed in follow-up PRs.

… gosec Add error handling for: - http.ResponseWriter.Write() calls in health and ready endpoints - temp.Close() and os.Remove() calls in file_stager cleanup paths These changes address gosec G104 (CWE-703) warnings about unhandled errors. While these errors are typically non-critical (write failures in HTTP handlers, cleanup failures in error paths), proper error handling improves observability and satisfies security scanning requirements.

…s attacks Configure the metrics HTTP server with a 10-second ReadHeaderTimeout to mitigate potential Slowloris denial-of-service attacks. This addresses gosec G112 (CWE-400) security warning.

Copilot AI review requested due to automatic review settings January 14, 2026 17:57

Copilot AI reviewed Jan 14, 2026

View reviewed changes

github-advanced-security bot found potential problems Jan 14, 2026

View reviewed changes

cmd/confd/cli.go Fixed Show fixed Hide fixed

pkg/metrics/health.go Fixed Show fixed Hide fixed

pkg/metrics/health.go Fixed Show fixed Hide fixed

pkg/metrics/health.go Fixed Show fixed Hide fixed

abtreece added 2 commits January 14, 2026 15:24

fix: Add ReadHeaderTimeout to metrics HTTP server to prevent Slowlori…

a6bd821

…s attacks Configure the metrics HTTP server with a 10-second ReadHeaderTimeout to mitigate potential Slowloris denial-of-service attacks. This addresses gosec G112 (CWE-400) security warning.

abtreece merged commit 77f303f into main Jan 14, 2026
11 checks passed

abtreece deleted the abtreece/issue-403-review branch January 14, 2026 21:37

This was referenced Jan 14, 2026

Add metrics and observability instrumentation #403

Closed

feat: Add templates_loaded and watched_keys gauge metrics #424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics and observability instrumentation #416

Add metrics and observability instrumentation #416

Uh oh!

abtreece commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add metrics and observability instrumentation #416

Add metrics and observability instrumentation #416

Uh oh!

Conversation

abtreece commented Jan 14, 2026

Summary

Testing

Metrics Available

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot commented Jan 14, 2026

PR #416 Review: Add metrics and observability instrumentation

Summary

✅ Strengths

Code Quality & Architecture

Security & Safety

Test Coverage

Integration

🔍 Areas for Consideration

1. Cardinality Concerns (Medium Priority)

2. Metrics Server Lifecycle (Low Priority)

3. Histogram Bucket Selection (Low Priority)

4. Error Message Exposure (Security - Low Risk)

5. Missing Metric Documentation

🎯 Performance Considerations

Positive

Potential Concerns

📊 Test Coverage Assessment

🔐 Security Assessment

✅ Recommendations Summary

Should Fix Before Merge

Could Fix in Follow-up PR

🎉 Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Jan 14, 2026 •

edited

Loading