Vigil Monitoring System

A backend-first, self-healing control plane that keeps services healthy by watching metrics, evaluating policies, and triggering automated remediation.

📌 Project Overview

Vigil is a lightweight monitoring and automation service. It receives metrics and events, decides whether something is wrong, and if needed, kicks off remediation tasks. Operators use Vigil to keep applications stable without watching dashboards all day. The project focuses on a clear, reliable backend that any frontend or automation pipeline can integrate with.

Why Vigil matters:

Reduces manual intervention by responding to incidents automatically.
Creates a shared source of truth for health signals, policies, and actions.
Fits easily into existing stacks thanks to a clean REST API and simple deployment.

✨ Core Features

Area	What It Does
Metrics Ingestion	Accepts structured metrics or events from agents and services.
Policy Engine	Evaluates metrics against human-friendly rules to spot issues early.
Action Management	Tracks remediation actions with full history and status updates.
Queue & Worker	Uses Redis-backed queue and async workers to process tasks safely.
Remediator Integration	Talks to remediator services to execute real changes.
Simulator	Generates synthetic load to test Vigil end-to-end.
Audit & Logging	Captures everything for later review, including Prometheus metrics.

🧭 Architecture at a Glance

flowchart LR
    Agents((Agents & Services)) -->|Metrics / Events| API[FastAPI Collector]
    API -->|Evaluate| Policy[Policy Engine]
    Policy -->|Violations| Queue[(Redis Queue)]
    Queue --> Worker[Async Worker]
    Worker --> Remediator[Go Remediator]
    Remediator -->|Results| ActionsDB[(Actions Store)]
    API -->|Audit Trail| Logs[(Structured Logs / Metrics)]
    API -->|Expose| Endpoints[/REST API/]

FastAPI Collector handles requests, validation, and routing.
Policy Engine decides whether metrics break thresholds.
Redis Queue + Worker ensure remediation happens in order and can retry.
Remediator performs the actual fix (restart service, scale replicas, etc.).
Audit Trail keeps record of what happened for observability.

🔄 Data Flow

sequenceDiagram
    participant Agent
    participant API
    participant Policy
    participant Queue
    participant Worker
    participant Remediator
    participant History

    Agent->>API: POST /api/v1/ingest (metric)
    API->>Policy: Evaluate metric
    alt Policy match
        Policy-->>API: Violation list
        API->>Queue: Enqueue remediation task
        Queue->>Worker: Deliver task (BLPOP)
        Worker->>Remediator: Execute action request
        Remediator-->>Worker: Result (success / failure)
        Worker->>History: Update action status
    else No match
        Policy-->>API: No violations
    end
    API-->>Agent: Response (stored, violations, actions)

Metrics arrive through the ingest endpoint.
Policies run immediately and also on a scheduled background runner.
Remediation tasks join the queue and workers process them.
Actions, statuses, and logs become available for dashboards or audits.

⚙️ Queue & Worker Lifecycle

stateDiagram-v2
    [*] --> Waiting
    Waiting --> Enqueuing: Policy violation detected
    Enqueuing --> Pending
    Pending --> Processing: Worker reserves task
    Processing --> Completed: Remediator success
    Processing --> Failed: Remediator error
    Failed --> Pending: Retry with backoff
    Completed --> Audited: Status stored + metrics updated
    Audited --> [*]

Pending tasks stay in Redis until a worker grabs them.
Processing includes calling the remediator and updating the action record.
Failed tasks record the error and can retry.
Completed tasks feed Prometheus counters, logs, and history endpoints.

🚀 Getting Started (Backend Only)

Prerequisites

Python 3.12+
Redis (local or Docker)
Optional: PostgreSQL (SQLite works by default)

1. Clone the repository

git clone https://github.com/fixitdaddy/vigil.git
cd vigil

2. Set up the backend

cd python/app
python -m venv .venv
source .venv/bin/activate
pip install -r ../requirements.txt

3. Configure environment (optional)

cp ../.env.example ../.env
# adjust DATABASE_URL, REDIS_URL, CORS_ORIGINS as needed

4. Start Vigil API

uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload

5. Verify health

curl http://localhost:8000/health

Expected:

{"status": "healthy", "service": "vigil"}

You now have the backend running. Connect any frontend (React, Vite, Google AI Studio, etc.) to http://localhost:8000.

🧪 Key API Endpoints

Endpoint	Method	Description
`/api/v1/ingest`	POST	Store metrics/events and trigger policy evaluation.
`/api/v1/actions`	GET	List actions with status, pagination ready.
`/api/v1/actions/{id}`	GET	Retrieve action details.
`/api/v1/actions/status/{status}`	GET	Filter actions (pending, running, completed, failed, cancelled).
`/api/v1/policies`	GET	List all policies and settings.
`/api/v1/policies`	POST	Create new policy with condition and action.
`/api/v1/policies/{name}`	PUT	Update policy metadata and toggles.
`/api/v1/policies/{name}`	DELETE	Remove a policy.
`/api/v1/policies/evaluate`	POST	Test policies against sample metrics.
`/api/v1/policies/runner/status`	GET	Check background policy runner health.
`/api/v1/ui/queue/stats`	GET	(To be moved) Queue depth, processed counts, failure totals.
`/api/v1/ui/simulator/*`	POST/GET	(To be moved) Start/stop simulator and view status.
`/metrics`	GET	Prometheus metrics scrape endpoint.
`/health`	GET	Simple service liveness check.

Example: Ingest Metric

curl -X POST http://localhost:8000/api/v1/ingest \
  -H "Content-Type: application/json" \
  -d '{
        "name": "cpu_usage",
        "value": 87.5,
        "tags": {"host": "web-01", "region": "us-east-1"}
      }'

Response:

{
  "ok": true,
  "metric_id": 12345,
  "status": "stored",
  "policies_evaluated": true,
  "violations": [
    {
      "policy_name": "high-cpu-alert",
      "severity": "warning",
      "action": "scale-up"
    }
  ]
}

Example: Queue Stats

curl http://localhost:8000/api/v1/ui/queue/stats

Response:

{
  "queue_length": 2,
  "tasks_enqueued": 158,
  "tasks_dequeued": 156,
  "tasks_failed": 3,
  "tasks_completed": 153,
  "last_processed_task": {
    "task_id": "task_1706956800",
    "action_id": 456,
    "target": "web-service-01",
    "timestamp": "2026-01-19T12:00:00Z"
  }
}

Example: Policy Management

curl -X POST http://localhost:8000/api/v1/policies \
  -H "Content-Type: application/json" \
  -d '{
        "name": "high-cpu-alert",
        "description": "Scale up when CPU exceeds 90%",
        "severity": "warning",
        "target": "web-*",
        "enabled": true,
        "auto_remediate": true,
        "condition": {
          "type": "metric_exceeds",
          "metric": "cpu_percent",
          "threshold": 90
        },
        "action": "scale-up",
        "params": {"replicas": 2}
      }'

🧑‍💻 Usage Walkthrough

Send Metrics: Agents post CPU, memory, HTTP response times, and custom signals to /api/v1/ingest.
Evaluate Policies: Policies flag abnormal values (e.g., CPU > 90%) and decide whether to remediate automatically.
Queue Tasks: Violations enqueue remediation tasks into Redis.
Worker Executes: Background worker dequeues tasks and calls the remediator service.
Remediator Acts: Remediator restarts services, scales deployments, or triggers custom handlers.
Review Actions: Operators check /api/v1/actions to see what happened and when.
Observe Metrics: Prometheus scrapes /metrics for dashboards and alerting.

📊 Observability & Monitoring

Channel	What You Get	Tools
Structured Logs	JSON logs for every request, policy evaluation, and task.	Log aggregation services, jq
Prometheus Metrics	Custom counters and gauges for ingest, queue depth, policy violations, and worker state.	Prometheus, Grafana
Queue Stats Endpoint	Live queue health, successes, failures, last task info.	Custom dashboards, frontend integration
Action History	Detailed status for each remediation, including timestamps and outcome.	REST clients, automation

Suggested setup:

Scrape /metrics with Prometheus.
Forward logs into ELK / Loki / any preferred stack.
Build dashboards in Grafana or Google AI Studio using the REST API.

🎬 Demo Story: "Scale Up the Web Tier"

Web service load increases, CPU hits 92%.
Agent posts cpu_usage=92 to Vigil.
Policy high-cpu-alert detects threshold breach and enqueues scale-up action.
Worker picks task and asks Remediator to scale the deployment to 4 replicas.
Remediator succeeds and reports back.
Actions API shows status=completed with timestamps and notes.
Prometheus metric vigil_policy_violations_total{policy_name="high-cpu-alert"} increments.
Operator sees the successful action and lower CPU usage on their dashboard minutes later.

🛣️ Future Roadmap

Area	Planned Enhancements
Remediation Plugins	Pluggable actions for Kubernetes, cloud APIs, serverless, and custom scripts.
External Hooks	Webhooks and message bus integrations for Slack, PagerDuty, Kafka, and event-driven workflows.
Scaling	Horizontal sharding of queue workers, multi-region Redis support, and policy evaluation batching.
Security	Authentication, RBAC roles, signed policy bundles, and TLS-by-default deployments.
Analytics	Richer reporting with anomaly detection, historical policy impact, and trend analysis.

�� Credits

Core Engineering: @techmedaddy

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.postman		.postman
ci		ci
configs		configs
docs		docs
go		go
k8s		k8s
manifests		manifests
postman		postman
python		python
sdk/typescript		sdk/typescript
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
run_worker.py		run_worker.py
simulate_failures.py		simulate_failures.py
test-frontend-connection.html		test-frontend-connection.html
test_phase3.py		test_phase3.py
test_phase3_simulation.py		test_phase3_simulation.py
verify_implementation.py		verify_implementation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vigil Monitoring System

📌 Project Overview

✨ Core Features

🧭 Architecture at a Glance

🔄 Data Flow

⚙️ Queue & Worker Lifecycle

🚀 Getting Started (Backend Only)

Prerequisites

1. Clone the repository

2. Set up the backend

3. Configure environment (optional)

4. Start Vigil API

5. Verify health

🧪 Key API Endpoints

Example: Ingest Metric

Example: Queue Stats

Example: Policy Management

🧑‍💻 Usage Walkthrough

📊 Observability & Monitoring

🎬 Demo Story: "Scale Up the Web Tier"

🛣️ Future Roadmap

�� Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vigil Monitoring System

📌 Project Overview

✨ Core Features

🧭 Architecture at a Glance

🔄 Data Flow

⚙️ Queue & Worker Lifecycle

🚀 Getting Started (Backend Only)

Prerequisites

1. Clone the repository

2. Set up the backend

3. Configure environment (optional)

4. Start Vigil API

5. Verify health

🧪 Key API Endpoints

Example: Ingest Metric

Example: Queue Stats

Example: Policy Management

🧑‍💻 Usage Walkthrough

📊 Observability & Monitoring

🎬 Demo Story: "Scale Up the Web Tier"

🛣️ Future Roadmap

�� Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages