Skip to content

Conversation

@abtreece
Copy link
Owner

Summary

Implements Phase 1 of #403: adds Prometheus metrics instrumentation for production observability. Includes optional --metrics-addr flag to expose metrics endpoint, plus /health and /ready probes. Instruments backend operations, template processing, command execution, and file sync with latency histograms and success/error counters.

Testing

  • All 24 metrics tests pass
  • All existing tests pass (no breaking changes)
  • Build succeeds without errors
  • Metrics disabled by default (zero overhead)

Metrics Available

Backend: confd_backend_request_duration_seconds, confd_backend_request_total, confd_backend_errors_total, confd_backend_healthy

Template: confd_template_process_duration_seconds, confd_template_process_total, confd_template_cache_hits_total, confd_template_cache_misses_total

Commands: confd_command_duration_seconds, confd_command_total, confd_command_exit_code_total

Files: confd_file_sync_total, confd_file_changed_total

Implement Phase 1 of issue #403 with Prometheus metrics for production monitoring. Adds optional --metrics-addr flag to expose metrics endpoint (/metrics), health checks (/health, /ready), and instruments backend operations, template processing, command execution, and file sync operations with duration histograms and success/error counters. All metrics are optional and disabled by default.
Copilot AI review requested due to automatic review settings January 14, 2026 17:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Phase 1 of observability instrumentation by adding Prometheus metrics support to confd. The changes enable optional metrics collection through a --metrics-addr flag, exposing an HTTP endpoint with Prometheus metrics, health checks, and readiness probes. The implementation instruments backend operations, template processing, command execution, and file synchronization with latency histograms and counters.

Changes:

  • Added comprehensive Prometheus metrics instrumentation across backend, template, command, and file operations
  • Implemented HTTP endpoints for metrics (/metrics), liveness (/health), and readiness (/ready) probes
  • Integrated metrics configuration through CLI flag and TOML config file

Reviewed changes

Copilot reviewed 13 out of 235 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/metrics/metrics.go Defines all Prometheus metrics and initialization logic
pkg/metrics/backend.go Implements instrumented wrapper for backend store clients
pkg/metrics/health.go Provides HTTP handlers for health and readiness probes
pkg/template/resource.go Adds metrics tracking to template processing operations
pkg/template/command_executor.go Instruments check and reload command execution with metrics
pkg/template/template_cache.go Tracks template cache hits and misses
pkg/template/file_stager.go Records file sync operations
cmd/confd/cli.go Integrates metrics server startup and client wrapping
cmd/confd/config.go Adds metrics_addr configuration option
go.mod Adds Prometheus client library dependency
pkg/metrics/metrics_test.go Tests for metrics initialization and recording
pkg/metrics/backend_test.go Tests for backend instrumentation wrapper
pkg/metrics/health_test.go Tests for health and readiness handlers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

❌ Patch coverage is 72.40000% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.30%. Comparing base (4bbca52) to head (a6bd821).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
pkg/template/command_executor.go 36.00% 14 Missing and 2 partials ⚠️
cmd/confd/cli.go 0.00% 15 Missing ⚠️
pkg/template/file_stager.go 0.00% 10 Missing and 1 partial ⚠️
pkg/template/resource.go 35.29% 6 Missing and 5 partials ⚠️
pkg/metrics/health.go 64.70% 3 Missing and 3 partials ⚠️
pkg/template/template_cache.go 0.00% 3 Missing and 3 partials ⚠️
cmd/confd/config.go 0.00% 1 Missing and 1 partial ⚠️
pkg/metrics/backend.go 94.11% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   61.79%   64.30%   +2.51%     
==========================================
  Files          37       41       +4     
  Lines        3484     3858     +374     
==========================================
+ Hits         2153     2481     +328     
- Misses       1187     1220      +33     
- Partials      144      157      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@claude
Copy link

claude bot commented Jan 14, 2026

PR #416 Review: Add metrics and observability instrumentation

Summary

This PR implements Phase 1 of #403 by adding comprehensive Prometheus metrics instrumentation to confd. The implementation is well-structured, follows Go best practices, and maintains backward compatibility with zero overhead when disabled.

✅ Strengths

Code Quality & Architecture

  • Excellent separation of concerns: The pkg/metrics/ package is cleanly isolated with minimal coupling to existing code
  • Zero-overhead when disabled: The Enabled() guard pattern ensures no performance impact when metrics are not in use
  • Decorator pattern: The InstrumentedClient wrapper for backends is a textbook example of non-invasive instrumentation
  • Consistent metric naming: All metrics follow the Prometheus naming convention (confd_<subsystem>_<metric>_<unit>)
  • Well-documented: Code includes clear comments explaining purpose and behavior

Security & Safety

  • No security vulnerabilities identified: The metrics server is opt-in and doesn't expose sensitive data
  • Proper context handling: Timeouts are enforced on health checks (5s)
  • Graceful degradation: If metrics are disabled mid-operation, no panics occur due to nil checks

Test Coverage

  • Comprehensive unit tests: 24+ tests covering initialization, metrics recording, health endpoints, and edge cases
  • Good mock usage: Tests properly isolate components with mock implementations
  • Edge cases covered: Tests verify behavior with metrics enabled/disabled, backend failures, etc.

Integration

  • Non-breaking changes: All existing functionality preserved, metrics are purely additive
  • Proper CLI flag: --metrics-addr with env var support (CONFD_METRICS_ADDR)
  • Health probes: Both /health (liveness) and /ready (readiness) endpoints provided for k8s integration

🔍 Areas for Consideration

1. Cardinality Concerns (Medium Priority)

Issue: Several metrics use high-cardinality labels that could cause problems in production:

// In command_executor.go:88,138
metrics.CommandDuration.WithLabelValues("check", e.dest).Observe(duration)
metrics.CommandTotal.WithLabelValues("check", e.dest).Inc()

// In file_stager.go:116
metrics.FileSyncTotal.WithLabelValues(destPath).Inc()

// In template processing
TemplateProcessDuration.WithLabelValues(dest).Observe(duration)

Problem: If users have hundreds or thousands of unique destination files, this creates unbounded metric series. In a system managing 1000 config files × 2 command types × 2 metrics = 4000+ time series just for commands.

Recommendation:

  • Consider bucketing destinations by pattern (e.g., /etc/app/*.confetc_app)
  • Or make high-cardinality labels optional via config flag
  • Add documentation warning about cardinality in environments with many templates

2. Metrics Server Lifecycle (Low Priority)

Location: cmd/confd/cli.go:383-394

Issue: The metrics HTTP server is started in a goroutine without graceful shutdown handling. When confd exits, the server may not close cleanly.

Recommendation: Add graceful shutdown using the existing context to properly close the server on SIGTERM/SIGINT.

3. Histogram Bucket Selection (Low Priority)

Issue: All histograms use prometheus.DefBuckets which are designed for typical HTTP request durations.

Consideration:

  • Backend operations might have different latency profiles (especially for slow backends like DynamoDB, SSM)
  • Command execution (especially reload commands) might take much longer

Recommendation: Consider custom buckets per metric type to better capture actual latency distributions.

4. Error Message Exposure (Security - Low Risk)

Location: pkg/metrics/health.go:31

w.Write([]byte("backend unhealthy: " + err.Error()))

Issue: Error messages might expose internal details (connection strings, paths, etc.).

Recommendation: Log the full error but return a generic message to external callers.

5. Missing Metric Documentation

Issue: The PR description lists available metrics, but there's no in-code documentation for operators.

Recommendation: Add a pkg/metrics/README.md with metric descriptions, example PromQL queries, and alerting rule suggestions.

🎯 Performance Considerations

Positive

  • ✅ Metrics are disabled by default (zero overhead)
  • ✅ Label values are pre-computed (no string formatting in hot path)
  • ✅ Registry is separate from global Prometheus registry (no conflicts)
  • ✅ No blocking operations in metric recording

Potential Concerns

  • ⚠️ High-cardinality labels could cause memory pressure with many destinations
  • ⚠️ No rate limiting on metrics endpoint (low risk, standard Prometheus pattern)

📊 Test Coverage Assessment

Coverage appears excellent:

  • ✅ Initialization and registration tested
  • ✅ Enabled/disabled state transitions tested
  • ✅ All metric types covered (counters, histograms, gauges)
  • ✅ Health endpoints tested for success and failure cases
  • ✅ Wrapper behavior tested (passthrough when disabled)

🔐 Security Assessment

No major security concerns identified:

  • ✅ Metrics endpoint is opt-in and requires explicit configuration
  • ✅ No authentication on metrics endpoint (standard Prometheus pattern - should be firewalled)
  • ✅ No user input in metric labels (all internal values)
  • ⚠️ Minor: Error messages in health endpoint could be more generic

Recommendations:

  • Document that --metrics-addr should only bind to internal networks
  • Add note about firewall rules in deployment docs

✅ Recommendations Summary

Should Fix Before Merge

  1. Cardinality warning in documentation: Add a note about high-cardinality labels and their implications
  2. Metrics server graceful shutdown: Implement proper cleanup on exit

Could Fix in Follow-up PR

  1. Custom histogram buckets for different operation types
  2. Template cache size metric
  3. Generic error messages in health endpoint
  4. Comprehensive metrics documentation (README/dashboard)

🎉 Conclusion

This is an excellent implementation of metrics instrumentation. The code quality is high, the design is non-invasive, and the test coverage is comprehensive. The main concern is the high-cardinality labels which could cause issues in large deployments, but this is easily addressable with documentation or configuration options.

Recommendation: ✅ Approve with minor suggestions

The PR is ready to merge after addressing the two "should fix" items (documentation and graceful shutdown). The other suggestions can be addressed in follow-up PRs.

… gosec

Add error handling for:
- http.ResponseWriter.Write() calls in health and ready endpoints
- temp.Close() and os.Remove() calls in file_stager cleanup paths

These changes address gosec G104 (CWE-703) warnings about unhandled errors.
While these errors are typically non-critical (write failures in HTTP handlers,
cleanup failures in error paths), proper error handling improves observability
and satisfies security scanning requirements.
…s attacks

Configure the metrics HTTP server with a 10-second ReadHeaderTimeout to
mitigate potential Slowloris denial-of-service attacks. This addresses
gosec G112 (CWE-400) security warning.
@abtreece abtreece merged commit 77f303f into main Jan 14, 2026
11 checks passed
@abtreece abtreece deleted the abtreece/issue-403-review branch January 14, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant