Building an autonomous remediation platform — from signal to action, with clear auditability.
Our stack has two core services:
- vigil → detects issues, evaluates policy, decides remediation
- chronos → schedules and executes remediation jobs reliably
Flow in one line: Detect → Decide → Schedule → Execute → Verify
Modern systems already generate alerts. The real problem is what happens after alerting:
- too much manual triage
- delayed remediation
- inconsistent execution
- weak traceability between incident and action
We are building a control plane where:
- vigil observes + evaluates policy
- if remediation is needed, vigil triggers a tenant-aware, idempotent job in chronos
- chronos handles durable scheduling, retries, worker execution, and outcome state
- vigil syncs execution status back into incident/action timelines for operators
Result: fewer manual loops, faster MTTR, and better operational confidence.
- ✅ Core scheduler/worker architecture in place
- ✅ API + execution state model available
- ✅ Retry/dead-letter/recovery paths implemented (MVP/hardening level)
- ✅ Observability + deployment scaffolding (compose/k8s baseline)
⚠️ Some components are still being hardened from MVP to production-grade internals
- ✅ Direction and integration strategy locked
- ✅ Initial integration implementation work underway
- 🚧 Correlation + sync hardening in progress
- 🚧 Full end-to-end contract validation and integration test depth is in progress
flowchart LR
A[Signals / Metrics / Events] --> B[vigil: Policy Engine]
B -->|Remediation decision| C[vigil: Remediation Client]
C -->|POST job + idempotency key| D[chronos API]
D --> E[Scheduler]
E --> F[Queue]
F --> G[Workers]
G --> H[Execution Result + State]
H --> I[vigil Status Sync]
I --> J[Incident / Action Timeline]
sequenceDiagram
participant V as vigil
participant C as chronos
V->>C: POST /jobs (tenant, trace_id, idempotency_key)
C-->>V: 202 Accepted (job_id, execution_id)
V->>V: store correlation fields
loop until terminal state
V->>C: GET execution status
C-->>V: state + attempts + timestamps
V->>V: map chronos state -> incident/action state
end
- Idempotency first: duplicate triggers must not create duplicate effective remediation
- Tenant isolation: cross-tenant execution is rejected by design
- Auditability: every decision and execution must be traceable
- Fail-safe retries: bounded retries + dead-letter for manual intervention paths
- Operational clarity: status mapping must be deterministic across both services
vigilremains the decision/control planechronosremains the execution engine- No monolith merge; strict API contract between services
- chronos execution state is persisted durably
- queue is transient transport
- redis/coordination is advisory/optimization where applicable
- finalize integration contract v1 (payloads, headers, errors)
- complete correlation persistence + status sync loop hardening
- expand integration tests: duplicates, retries, failures, dead-letter, tenant isolation
- operator-facing integration health panels
- tighter SLO-driven alerts
- production hardening pass on remaining MVP internals
- unified operator UX layer across vigil + chronos
- broader remediation playbooks and policy templates
- chronos — scheduling, dispatch, execution engine
- vigil — monitoring, policy, remediation control plane
If you are contributing, prioritize practical changes with testable outcomes and clear operator impact.