logcat

fixitdaddy

Building an autonomous remediation platform — from signal to action, with clear auditability.

Our stack has two core services:

vigil → detects issues, evaluates policy, decides remediation
chronos → schedules and executes remediation jobs reliably

Flow in one line: Detect → Decide → Schedule → Execute → Verify

What we are building (technical story)

Modern systems already generate alerts. The real problem is what happens after alerting:

too much manual triage
delayed remediation
inconsistent execution
weak traceability between incident and action

We are building a control plane where:

vigil observes + evaluates policy
if remediation is needed, vigil triggers a tenant-aware, idempotent job in chronos
chronos handles durable scheduling, retries, worker execution, and outcome state
vigil syncs execution status back into incident/action timelines for operators

Result: fewer manual loops, faster MTTR, and better operational confidence.

Current status

chronos

✅ Core scheduler/worker architecture in place
✅ API + execution state model available
✅ Retry/dead-letter/recovery paths implemented (MVP/hardening level)
✅ Observability + deployment scaffolding (compose/k8s baseline)
⚠️ Some components are still being hardened from MVP to production-grade internals

vigil ↔ chronos integration

✅ Direction and integration strategy locked
✅ Initial integration implementation work underway
🚧 Correlation + sync hardening in progress
🚧 Full end-to-end contract validation and integration test depth is in progress

System design (high level)

flowchart LR
    A[Signals / Metrics / Events] --> B[vigil: Policy Engine]
    B -->|Remediation decision| C[vigil: Remediation Client]
    C -->|POST job + idempotency key| D[chronos API]
    D --> E[Scheduler]
    E --> F[Queue]
    F --> G[Workers]
    G --> H[Execution Result + State]
    H --> I[vigil Status Sync]
    I --> J[Incident / Action Timeline]

Integration contract intent

sequenceDiagram
    participant V as vigil
    participant C as chronos

    V->>C: POST /jobs (tenant, trace_id, idempotency_key)
    C-->>V: 202 Accepted (job_id, execution_id)
    V->>V: store correlation fields
    loop until terminal state
      V->>C: GET execution status
      C-->>V: state + attempts + timestamps
      V->>V: map chronos state -> incident/action state
    end

Core principles

Idempotency first: duplicate triggers must not create duplicate effective remediation
Tenant isolation: cross-tenant execution is rejected by design
Auditability: every decision and execution must be traceable
Fail-safe retries: bounded retries + dead-letter for manual intervention paths
Operational clarity: status mapping must be deterministic across both services

Current working model

Service boundaries

vigil remains the decision/control plane
chronos remains the execution engine
No monolith merge; strict API contract between services

Source of truth

chronos execution state is persisted durably
queue is transient transport
redis/coordination is advisory/optimization where applicable

Short roadmap (next)

Near-term (now)

finalize integration contract v1 (payloads, headers, errors)
complete correlation persistence + status sync loop hardening
expand integration tests: duplicates, retries, failures, dead-letter, tenant isolation

Next step (soon)

operator-facing integration health panels
tighter SLO-driven alerts
production hardening pass on remaining MVP internals

Later

unified operator UX layer across vigil + chronos
broader remediation playbooks and policy templates

Repositories

chronos — scheduling, dispatch, execution engine
vigil — monitoring, policy, remediation control plane

If you are contributing, prioritize practical changes with testable outcomes and clear operator impact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logcat

fixitdaddy

What we are building (technical story)

Current status

chronos

vigil ↔ chronos integration

System design (high level)

Integration contract intent

Core principles

Current working model

Service boundaries

Source of truth

Short roadmap (next)

Near-term (now)

Next step (soon)

Later

Repositories

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!