AutoWatch is a lightweight, self-healing infrastructure monitoring tool designed for SREs and System Administrators. It monitors system resources (CPU, Memory, Disk) and critical services, automatically attempting remediation and logging alerts when thresholds are breached.
autowatch/
βββ bin/
β βββ monitor.sh # Main logic: checks metrics vs thresholds
β βββ remediate.sh # Action scripts: cleans disk, restarts services
βββ config/
β βββ thresholds.conf # Define limits for CPU, RAM, Disk
β βββ services.conf # List of services to keep alive (nginx, ssh, etc.)
βββ alerts/
β βββ notifier.py # Python script to handle logging and notifications
βββ cron/
β βββ autowatch.cron # Cron job definition for continuous monitoring
βββ logs/
β βββ metrics.log # Time-series data of system health
β βββ alerts.log # History of incidents and remediation actions
βββ runbooks/ # Documentation for manual incident resolution
βββ setup.sh # One-click installation script
-
Initialize the Environment Run the setup script to create necessary directories and set permissions.
./setup.sh
-
Configure Thresholds Edit
config/thresholds.confto set your desired limits.CPU_LIMIT=80 MEM_LIMIT=75 DISK_LIMIT=85
-
Define Critical Services Add service names (as recognized by
systemctl) toconfig/services.conf.nginx docker cron -
Run Manually Test the monitoring script.
./bin/monitor.sh
-
Automate with Cron Link the cron job to run every 2 minutes.
crontab cron/autowatch.cron
- Monitor:
monitor.shgathers current system stats. - Evaluate: Uses
bcfor precise floating-point comparison against config. - Alert: If a threshold is breached,
notifier.pylogs the incident tologs/alerts.log. - Remediate:
- Disk Full: Triggers
remediate.sh diskto clean/tmpand vacuum logs. - Service Down: Triggers
remediate.sh service <name>to restart the failed service.
- Disk Full: Triggers
- metrics.log:
2025-12-23 20:00:00 cpu=12.5% mem=45.2% disk=60% - alerts.log:
2025-12-23 20:05:00 [ALERT] CPU usage critical: 92%
Built for reliability.