Skip to content
@fixitdaddy

logcat

fixitdaddy

Building an autonomous remediation platform — from signal to action, with clear auditability.

Our stack has two core services:

  • vigil → detects issues, evaluates policy, decides remediation
  • chronos → schedules and executes remediation jobs reliably

Flow in one line: Detect → Decide → Schedule → Execute → Verify


What we are building (technical story)

Modern systems already generate alerts. The real problem is what happens after alerting:

  • too much manual triage
  • delayed remediation
  • inconsistent execution
  • weak traceability between incident and action

We are building a control plane where:

  1. vigil observes + evaluates policy
  2. if remediation is needed, vigil triggers a tenant-aware, idempotent job in chronos
  3. chronos handles durable scheduling, retries, worker execution, and outcome state
  4. vigil syncs execution status back into incident/action timelines for operators

Result: fewer manual loops, faster MTTR, and better operational confidence.


Current status

chronos

  • ✅ Core scheduler/worker architecture in place
  • ✅ API + execution state model available
  • ✅ Retry/dead-letter/recovery paths implemented (MVP/hardening level)
  • ✅ Observability + deployment scaffolding (compose/k8s baseline)
  • ⚠️ Some components are still being hardened from MVP to production-grade internals

vigil ↔ chronos integration

  • ✅ Direction and integration strategy locked
  • ✅ Initial integration implementation work underway
  • 🚧 Correlation + sync hardening in progress
  • 🚧 Full end-to-end contract validation and integration test depth is in progress

System design (high level)

flowchart LR
    A[Signals / Metrics / Events] --> B[vigil: Policy Engine]
    B -->|Remediation decision| C[vigil: Remediation Client]
    C -->|POST job + idempotency key| D[chronos API]
    D --> E[Scheduler]
    E --> F[Queue]
    F --> G[Workers]
    G --> H[Execution Result + State]
    H --> I[vigil Status Sync]
    I --> J[Incident / Action Timeline]
Loading

Integration contract intent

sequenceDiagram
    participant V as vigil
    participant C as chronos

    V->>C: POST /jobs (tenant, trace_id, idempotency_key)
    C-->>V: 202 Accepted (job_id, execution_id)
    V->>V: store correlation fields
    loop until terminal state
      V->>C: GET execution status
      C-->>V: state + attempts + timestamps
      V->>V: map chronos state -> incident/action state
    end
Loading

Core principles

  • Idempotency first: duplicate triggers must not create duplicate effective remediation
  • Tenant isolation: cross-tenant execution is rejected by design
  • Auditability: every decision and execution must be traceable
  • Fail-safe retries: bounded retries + dead-letter for manual intervention paths
  • Operational clarity: status mapping must be deterministic across both services

Current working model

Service boundaries

  • vigil remains the decision/control plane
  • chronos remains the execution engine
  • No monolith merge; strict API contract between services

Source of truth

  • chronos execution state is persisted durably
  • queue is transient transport
  • redis/coordination is advisory/optimization where applicable

Short roadmap (next)

Near-term (now)

  • finalize integration contract v1 (payloads, headers, errors)
  • complete correlation persistence + status sync loop hardening
  • expand integration tests: duplicates, retries, failures, dead-letter, tenant isolation

Next step (soon)

  • operator-facing integration health panels
  • tighter SLO-driven alerts
  • production hardening pass on remaining MVP internals

Later

  • unified operator UX layer across vigil + chronos
  • broader remediation playbooks and policy templates

Repositories

  • chronos — scheduling, dispatch, execution engine
  • vigil — monitoring, policy, remediation control plane

If you are contributing, prioritize practical changes with testable outcomes and clear operator impact.

Popular repositories Loading

  1. vigil vigil Public

    Real-time policy monitoring, drift detection, and automated remediation for cloud-native infrastructure.

    Python

  2. chronos chronos Public

    Distributed C++ job scheduler and worker execution backend with retries, failover, observability, and docker-compose deployment.

    C++

  3. .github .github Public

Repositories

Showing 3 of 3 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…