Skip to content

feat: add coordinator and worker health endpoints#1802

Merged
yottahmd merged 3 commits intomainfrom
worker-healthport
Mar 19, 2026
Merged

feat: add coordinator and worker health endpoints#1802
yottahmd merged 3 commits intomainfrom
worker-healthport

Conversation

@yottahmd
Copy link
Copy Markdown
Collaborator

@yottahmd yottahmd commented Mar 19, 2026

Summary

  • add native HTTP /health servers for coordinator and worker using a shared healthcheck server
  • add config, CLI, schema, Helm, and test coverage for dedicated health ports
  • keep coordinator gRPC health unchanged and keep start-all from exposing the dedicated coordinator health port

Testing

  • go test ./internal/cmn/config ./internal/cmn/schema -count=1
  • go test ./internal/service/healthcheck ./internal/service/scheduler ./internal/service/coordinator ./internal/service/worker -count=1
  • go test ./internal/cmd -count=1

Closes #1788

Summary by CodeRabbit

Release Notes

  • New Features

    • Added HTTP health check servers for Coordinator (port 8091) and Worker (port 8092) with configurable ports via environment variables (DAGU_COORDINATOR_HEALTH_PORT, DAGU_WORKER_HEALTH_PORT) and CLI flags. Set port to 0 to disable health checks.
  • Documentation

    • Updated README and Helm chart documentation to reference new HTTP health check endpoints.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2f91e41c-6f37-4b5a-927b-60de81be8f33

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces HTTP health check servers for Coordinator and Worker services with configurable ports (default 8091 and 8092 respectively), where setting port to 0 disables the health server. New HTTP /health endpoint returns JSON status, replacing TCP-based health checks. Includes configuration, CLI flags, service lifecycle integration, Kubernetes deployment updates, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Documentation & Configuration
README.md, charts/dagu/README.md, internal/cmn/schema/config.schema.json
Added environment variables and schema definitions for health port settings with port number and disable-with-zero semantics.
CLI Flags & Command Configuration
internal/cmd/flags.go, internal/cmd/coord.go, internal/cmd/coord_test.go, internal/cmd/worker.go, internal/cmd/worker_test.go, internal/cmd/startall.go
Introduced --coordinator.health-port and --worker.health-port CLI flags with defaults (8091/8092), integrated flag parsing into service initialization, and added flag validation tests. Coordinator disables health server in start-all in-process mode.
Configuration Structs & Loading
internal/cmn/config/config.go, internal/cmn/config/config_test.go, internal/cmn/config/definition.go, internal/cmn/config/loader.go, internal/cmn/config/loader_test.go
Added HealthPort field to Coordinator and Worker structs with validation (0–65535 range), implemented port defaulting logic (8091/8092), added environment variable bindings, and added comprehensive loader/validation tests.
HTTP Healthcheck Server Implementation
internal/service/healthcheck/server.go, internal/service/healthcheck/server_test.go
New shared HTTP healthcheck server package providing /health endpoint returning JSON {"status":"healthy"}, with graceful start/stop lifecycle management and ability to disable via port 0.
Coordinator Service Integration
internal/service/coordinator/service.go, internal/service/coordinator/health_test.go
Updated coordinator to manage both gRPC and HTTP health servers, added DisableHealthServer() method, integrated health server startup/cleanup on failures, and added integration tests verifying health endpoint and cleanup behavior.
Worker Service Integration
internal/service/worker/worker.go, internal/service/worker/health_test.go
Added health server field to Worker, integrated HTTP health server lifecycle into start/stop methods with error propagation, and added tests verifying health endpoint availability and resilience during heartbeat failures.
Scheduler Health Server Refactoring
internal/service/scheduler/health.go
Replaced local health server implementation with aliases to shared healthcheck package, removing duplicate code (~150 lines) while delegating to centralized healthcheck server.
Helm & Kubernetes
charts/dagu/values.yaml, charts/dagu/templates/configmap.yaml, charts/dagu/templates/coordinator-deployment.yaml, charts/dagu/templates/worker-deployment.yaml
Added health port configuration values, exposed ports in deployments, switched liveness/readiness probes from TCP socket checks to HTTP GET /health endpoints.
Test Infrastructure
internal/test/coordinator.go, internal/test/helper.go
Updated test helpers to disable health servers by default (port 0) and configured generated test YAML with health port fields.

Sequence Diagram(s)

sequenceDiagram
    participant Client as External Client<br/>(Probe/Monitor)
    participant Svc as Coordinator/<br/>Worker Service
    participant HTTP as HTTP Health<br/>Server
    participant gRPC as gRPC Health<br/>Server

    rect rgba(100, 150, 200, 0.5)
    Note over Client,gRPC: Health Check Flow
    Client->>HTTP: GET /health
    HTTP->>Svc: Check service status
    Svc->>gRPC: Query gRPC health state
    gRPC-->>Svc: SERVING status
    Svc-->>HTTP: Healthy status
    HTTP-->>Client: 200 OK {"status":"healthy"}
    end

    rect rgba(150, 100, 150, 0.5)
    Note over Svc,HTTP: Lifecycle Management
    Svc->>HTTP: Start (on service startup)
    HTTP->>HTTP: Listen on port
    HTTP-->>Svc: Ready
    
    Svc->>gRPC: Set SERVING
    
    Note over Svc,HTTP: ... service running ...
    
    Svc->>HTTP: Stop (on service shutdown)
    HTTP->>HTTP: Graceful shutdown
    HTTP-->>Svc: Stopped
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • #1584: Modifies worker service lifecycle methods (Start/Stop) and struct initialization, sharing overlapping code changes with this PR's worker integration.
  • #1564: Updates coordinator service construction and NewService wiring, directly related to this PR's coordinator health server integration.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add coordinator and worker health endpoints' clearly summarizes the main change—adding HTTP health endpoints for both coordinator and worker services.
Linked Issues check ✅ Passed The PR implements the core requirement from #1788: native HTTP health endpoints for coordinator and worker services that can be checked individually, with configurable ports and the ability to disable via setting to 0.
Out of Scope Changes check ✅ Passed All changes align with the health endpoint feature: configuration structures, CLI flags, Helm charts, new healthcheck server implementation, service integration, tests, and documentation updates are all necessary for the feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worker-healthport
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
internal/service/healthcheck/server.go (1)

72-125: Consider using errors.Is() for error comparison.

Line 116 compares errors directly with !=. While this works for http.ErrServerClosed, using errors.Is() is the idiomatic Go approach that handles wrapped errors correctly.

♻️ Suggested improvement
-		if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
+		if err := server.Serve(listener); err != nil && !errors.Is(err, http.ErrServerClosed) {

You'll also need to add "errors" to the imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/service/healthcheck/server.go` around lines 72 - 125, The health
server goroutine in Start uses a direct error comparison (err !=
http.ErrServerClosed) which fails for wrapped errors; update the check inside
the anonymous goroutine in Start (referencing server.Serve) to use
errors.Is(err, http.ErrServerClosed) instead of !=, and add "errors" to the
imports so the code compiles and correctly handles wrapped errors when logging
in the Serve error branch.
internal/service/worker/worker.go (1)

214-219: Consider stopping the health server earlier in the shutdown sequence.

Currently, the health server is stopped last, after the PostgreSQL pool manager and coordinator client cleanup. For Kubernetes deployments, stopping the health server first would cause probes to fail immediately, signaling that the pod is no longer ready to receive traffic while other cleanup proceeds.

This is a minor operational consideration and not a blocking issue.

🔧 Suggested reordering (optional)
 func (w *Worker) Stop(ctx context.Context) error {
 	var err error
 	w.stopOnce.Do(func() {
 		logger.Info(ctx, "Worker stopping", tag.WorkerID(w.id))
 
+		// Stop health server first to signal unavailability to probes
+		if w.healthServer != nil {
+			if stopErr := w.healthServer.Stop(ctx); stopErr != nil && err == nil {
+				err = fmt.Errorf("failed to stop worker health check server: %w", stopErr)
+			}
+		}
+
 		// Cancel the internal context to signal all goroutines to stop
 		if w.stopCancel != nil {
 			w.stopCancel()
 		}
 		// ... rest of cleanup ...
-
-		if w.healthServer != nil {
-			if stopErr := w.healthServer.Stop(ctx); stopErr != nil && err == nil {
-				err = fmt.Errorf("failed to stop worker health check server: %w", stopErr)
-			}
-		}
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/service/worker/worker.go` around lines 214 - 219, Move the
health-server stop earlier in the worker shutdown sequence so readiness probes
fail immediately: in the worker shutdown function (the method that calls
w.healthServer.Stop currently after Postgres pool and coordinator cleanup),
invoke w.healthServer.Stop(ctx) near the start of the teardown (before cleaning
up the Postgres pool manager and coordinator client) and keep the existing error
handling (assigning to err when stopErr != nil && err == nil) to preserve
semantics.
internal/service/coordinator/service.go (1)

69-72: Thread-safety consideration for DisableHealthServer().

This method sets a flag without synchronization. The current usage in startall.go (called immediately after construction, before Start()) is safe. However, if this method were ever called concurrently with Start(), it would create a data race on disableHealthServer.

Consider documenting that this must be called before Start(), or adding mutex protection for defensive coding:

🛡️ Optional: Add documentation or mutex protection

Option 1: Document the constraint

 // DisableHealthServer disables the dedicated HTTP health check server.
+// Must be called before Start().
 func (srv *Service) DisableHealthServer() {
 	srv.disableHealthServer = true
 }

Option 2: Add mutex protection

 func (srv *Service) DisableHealthServer() {
+	srv.mu.Lock()
+	defer srv.mu.Unlock()
 	srv.disableHealthServer = true
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/service/coordinator/service.go` around lines 69 - 72, The
DisableHealthServer method mutates srv.disableHealthServer without
synchronization which can race with Start; either document the usage constraint
or make the setter thread-safe. Locate the Service type and the
DisableHealthServer method and either add a comment on DisableHealthServer
stating it must be called before Start() (and reference Start in the comment),
or protect access to disableHealthServer with the Service's mutex (e.g.,
lock/unlock around reads/writes where Start reads it and in DisableHealthServer)
so concurrent calls are safe; ensure both setter and any readers (Start) use the
same synchronization primitive.
internal/service/worker/health_test.go (1)

84-134: Well-designed test helpers.

The polling-based helpers using require.Eventually are appropriate for testing async server startup/shutdown. The timeouts (5s) and poll intervals (10ms) are reasonable.

Note: These helpers are duplicated in internal/service/coordinator/health_test.go. If the test suite grows, consider extracting them to a shared test utility package, but this is fine for now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/service/worker/health_test.go` around lines 84 - 134, The three
duplicated test helper functions (requireWorkerHealthServerURL,
requireHealthyWorkerHealth, requireWorkerHealthStopped) should be extracted from
internal/service/worker/health_test.go into a shared test utility package (e.g.,
internal/testutil or internal/test/helpers); move the functions there preserving
their signatures and behavior, update internal/service/worker/health_test.go and
internal/service/coordinator/health_test.go to import that package and call the
helpers, and ensure any test imports (testing, healthcheck, http, time, json,
require) are available or re-exported as needed so the tests compile and run
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@charts/dagu/templates/coordinator-deployment.yaml`:
- Around line 43-45: The template is rendering a containerPort and HTTP probes
even when coordinator.healthPort is 0 (which disables the health server) causing
invalid manifests; update coordinator-deployment.yaml to conditionally render
the health containerPort block and the livenessProbe/readinessProbe httpGet
entries only when .Values.coordinator.healthPort is non‑zero (e.g. guard the "-
name: health / containerPort: {{ .Values.coordinator.healthPort }}" block and
the livenessProbe/readinessProbe httpGet path/port with an if check on
.Values.coordinator.healthPort), and apply the same conditional guarding in
worker-deployment.yaml for the worker health port and its probes so ports/probes
are omitted when healthPort == 0.

In `@charts/dagu/templates/worker-deployment.yaml`:
- Around line 51-54: The template currently always renders the health
containerPort and the /health probes even when worker.healthPort is set to 0;
update the Helm template in worker-deployment.yaml so the health port entry (the
ports: - name: health containerPort block) and the liveness/readiness probe
blocks are conditionally rendered only when $.Values.worker.healthPort != 0
(i.e., skip emitting the health port and the probe definitions when
worker.healthPort is 0), ensuring the container spec and Probe fields are
omitted when the health server is disabled.

---

Nitpick comments:
In `@internal/service/coordinator/service.go`:
- Around line 69-72: The DisableHealthServer method mutates
srv.disableHealthServer without synchronization which can race with Start;
either document the usage constraint or make the setter thread-safe. Locate the
Service type and the DisableHealthServer method and either add a comment on
DisableHealthServer stating it must be called before Start() (and reference
Start in the comment), or protect access to disableHealthServer with the
Service's mutex (e.g., lock/unlock around reads/writes where Start reads it and
in DisableHealthServer) so concurrent calls are safe; ensure both setter and any
readers (Start) use the same synchronization primitive.

In `@internal/service/healthcheck/server.go`:
- Around line 72-125: The health server goroutine in Start uses a direct error
comparison (err != http.ErrServerClosed) which fails for wrapped errors; update
the check inside the anonymous goroutine in Start (referencing server.Serve) to
use errors.Is(err, http.ErrServerClosed) instead of !=, and add "errors" to the
imports so the code compiles and correctly handles wrapped errors when logging
in the Serve error branch.

In `@internal/service/worker/health_test.go`:
- Around line 84-134: The three duplicated test helper functions
(requireWorkerHealthServerURL, requireHealthyWorkerHealth,
requireWorkerHealthStopped) should be extracted from
internal/service/worker/health_test.go into a shared test utility package (e.g.,
internal/testutil or internal/test/helpers); move the functions there preserving
their signatures and behavior, update internal/service/worker/health_test.go and
internal/service/coordinator/health_test.go to import that package and call the
helpers, and ensure any test imports (testing, healthcheck, http, time, json,
require) are available or re-exported as needed so the tests compile and run
unchanged.

In `@internal/service/worker/worker.go`:
- Around line 214-219: Move the health-server stop earlier in the worker
shutdown sequence so readiness probes fail immediately: in the worker shutdown
function (the method that calls w.healthServer.Stop currently after Postgres
pool and coordinator cleanup), invoke w.healthServer.Stop(ctx) near the start of
the teardown (before cleaning up the Postgres pool manager and coordinator
client) and keep the existing error handling (assigning to err when stopErr !=
nil && err == nil) to preserve semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2bc6e523-b342-4fa2-9474-e5d451953cd7

📥 Commits

Reviewing files that changed from the base of the PR and between 182e223 and baf245f.

📒 Files selected for processing (27)
  • README.md
  • charts/dagu/README.md
  • charts/dagu/templates/configmap.yaml
  • charts/dagu/templates/coordinator-deployment.yaml
  • charts/dagu/templates/worker-deployment.yaml
  • charts/dagu/values.yaml
  • internal/cmd/coord.go
  • internal/cmd/coord_test.go
  • internal/cmd/flags.go
  • internal/cmd/startall.go
  • internal/cmd/worker.go
  • internal/cmd/worker_test.go
  • internal/cmn/config/config.go
  • internal/cmn/config/config_test.go
  • internal/cmn/config/definition.go
  • internal/cmn/config/loader.go
  • internal/cmn/config/loader_test.go
  • internal/cmn/schema/config.schema.json
  • internal/service/coordinator/health_test.go
  • internal/service/coordinator/service.go
  • internal/service/healthcheck/server.go
  • internal/service/healthcheck/server_test.go
  • internal/service/scheduler/health.go
  • internal/service/worker/health_test.go
  • internal/service/worker/worker.go
  • internal/test/coordinator.go
  • internal/test/helper.go

Comment thread charts/dagu/templates/coordinator-deployment.yaml
Comment thread charts/dagu/templates/worker-deployment.yaml
@yottahmd yottahmd merged commit 38be2a2 into main Mar 19, 2026
6 checks passed
@yottahmd yottahmd deleted the worker-healthport branch March 19, 2026 09:25
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 79.12621% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.21%. Comparing base (63ed05d) to head (78f7588).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
internal/service/healthcheck/server.go 74.00% 19 Missing and 7 partials ⚠️
internal/service/worker/worker.go 47.61% 9 Missing and 2 partials ⚠️
internal/service/coordinator/service.go 90.69% 2 Missing and 2 partials ⚠️
internal/cmn/config/config.go 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1802      +/-   ##
==========================================
+ Coverage   69.02%   69.21%   +0.19%     
==========================================
  Files         424      425       +1     
  Lines       51230    51394     +164     
==========================================
+ Hits        35361    35574     +213     
+ Misses      12853    12802      -51     
- Partials     3016     3018       +2     
Files with missing lines Coverage Δ
internal/cmd/coord.go 74.59% <100.00%> (ø)
internal/cmd/flags.go 100.00% <ø> (ø)
internal/cmd/startall.go 48.63% <100.00%> (-0.56%) ⬇️
internal/cmd/worker.go 53.06% <100.00%> (+0.97%) ⬆️
internal/cmn/config/loader.go 81.66% <100.00%> (+0.41%) ⬆️
internal/service/scheduler/health.go 100.00% <100.00%> (+22.07%) ⬆️
internal/cmn/config/config.go 73.80% <87.50%> (+1.12%) ⬆️
internal/service/coordinator/service.go 85.56% <90.69%> (+3.21%) ⬆️
internal/service/worker/worker.go 76.53% <47.61%> (+2.09%) ⬆️
internal/service/healthcheck/server.go 74.00% <74.00%> (ø)

... and 13 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63ed05d...78f7588. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: provide a native health-check for the coordinator and worker services

1 participant