Self-hosted observability stack for the Asymptora Platform Engineering Squad. Runs on bare-metal Ubuntu Server via Docker Compose. Focus: learning to monitor, diagnose, and respond to incidents on Linux servers.
Context: this project runs on the lab server
asymptora-prod-01. No production services are hosted here — the goal is strictly educational. Once real projects start, the server will be wiped and this same repository will re-provision the stack from scratch.
| Component | Version | Role |
|---|---|---|
| Prometheus | 2.55 | Collects, stores, and evaluates metrics and alert rules |
| node_exporter | 1.8 | Exports host metrics (CPU, RAM, disk, network, systemd, processes, SSH) |
| Alertmanager | 0.27 | Deduplicates, groups, and routes alerts to Discord and ntfy |
| Grafana | 11.3 | Dashboard visualization |
| ntfy | 2.11 | Self-hosted push notifications for critical alerts on mobile |
Two Discord channels with different purposes:
| Channel | Purpose | Trigger |
|---|---|---|
#-infra-status |
Periodic health report — "is the host alive?" | Heartbeat every 6 hours |
#-incidentes |
Actionable alerts — "someone must act now" | Thresholds crossed |
This separation prevents alert fatigue: the status channel never gets incident noise, and the incidents channel never gets heartbeat noise.
- heartbeat →
#-infra-statusonly (every 6 hours, regardless of state) - info / warning →
#-incidentes - critical →
#-incidentesand ntfy (urgent push that bypasses Do Not Disturb) - Inhibition rules: if a host is down, warning alerts for the same host are silenced
| Category | Metric | Alert |
|---|---|---|
| Heartbeat | vector(1) |
InfraStatusHeartbeat |
| Availability | up, node_boot_time_seconds |
HostDown, HostRebooted |
| CPU saturation | node_load1, iowait |
HighLoadAverage, CriticalLoadAverage, HighIOWait |
| CPU utilization | node_cpu_seconds_total |
HighCPUUsage |
| Memory | node_memory_MemAvailable_bytes, SwapFree |
HighMemoryUsage, CriticalMemoryUsage, SwapUsageHigh |
| Disk | node_filesystem_avail_bytes, files_free |
DiskSpaceWarning, DiskSpaceCritical, InodesRunningOut |
| Integrity | node_timex_offset_seconds |
ClockSkewDetected |
| systemd | node_systemd_unit_state |
SystemdServiceFailed, SSHServiceDown |
| SSH | ssh_active_sessions, ssh_failed_logins_total, ssh_accepted_logins_total |
SSHSessionOpened, SSHFailedLoginsBurst |
SSH metrics are produced by scripts/ssh-metrics.sh, which writes into the node_exporter textfile collector every 30 seconds.
All commands below run on asymptora-prod-01 via SSH/Tailscale, as a user in the devops group.
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl git ca-certificates gnupg lsb-releasesudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker "$USER"
newgrp docker
docker version && docker compose versionOnly SSH, Grafana, and ntfy ports are open — and only over Tailscale.
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow in on tailscale0 to any port 22 proto tcp
sudo ufw allow in on tailscale0 to any port 3000 proto tcp
sudo ufw allow in on tailscale0 to any port 2586 proto tcp
sudo ufw enable
sudo ufw status verboseThe #-infra-status and #-incidentes channels each need their own webhook:
Webhook 1 — #-infra-status (periodic status every 6 hours)
- Channel settings of
#-infra-status→ Integrations → Webhooks → New Webhook. - Name:
Heartbeat. - Copy Webhook URL → save as
DISCORD_WEBHOOK_STATUS.
Webhook 2 — #-incidentes (real alerts)
- Channel settings of
#-incidentes→ Integrations → Webhooks → New Webhook. - Name:
Alertmanager. - Copy Webhook URL → save as
DISCORD_WEBHOOK_INCIDENTS.
Both URLs go into .env in step 4.
ntfy has no user/password for publishing by default: the topic name is the secret. It must be long and random.
openssl rand -hex 16 | sed 's/^/asymptora-/'
# example output: asymptora-7e3f1a9c8d5b4e2f0a6c9d8e7b1f2a3cSave the output — it becomes NTFY_TOPIC in .env.
On the phone:
- Install the ntfy app (Play Store / App Store).
- Open →
+icon → Subscribe to topic. - Change the default server to
http://asymptora-prod-01:2586(requires Tailscale). - Paste the generated topic.
- Enable notifications with bypass Do Not Disturb (Android) / Critical (iOS).
sudo mkdir -p /opt/asymptora
sudo chown "$USER:$USER" /opt/asymptora
cd /opt/asymptora
git clone https://github.com/asymptora/observability-stack.git
cd observability-stackcp .env.example .env
chmod 600 .env
nano .envFill in the 5 fields: GRAFANA_ADMIN_PASSWORD, DISCORD_WEBHOOK_STATUS, DISCORD_WEBHOOK_INCIDENTS, NTFY_TOPIC (from step 2), HOST_LABEL. Save with Ctrl+O, Enter, Ctrl+X.
chmod +x scripts/fetch-dashboards.sh
./scripts/fetch-dashboards.sh
ls -lh grafana/dashboards/Three .json files should appear (Node Exporter Full, Prometheus Stats, Alertmanager).
sudo mkdir -p /var/lib/node_exporter/textfile
sudo cp scripts/ssh-metrics.sh /usr/local/bin/ssh-metrics.sh
sudo chmod +x /usr/local/bin/ssh-metrics.sh
sudo cp systemd/ssh-metrics.service /etc/systemd/system/
sudo cp systemd/ssh-metrics.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now ssh-metrics.timer
# Validate
sudo systemctl status ssh-metrics.timer
sudo /usr/local/bin/ssh-metrics.sh
cat /var/lib/node_exporter/textfile/ssh.promThe cat output should show three metrics: ssh_active_sessions, ssh_failed_logins_total, ssh_accepted_logins_total.
docker run --rm -v "$PWD/prometheus":/etc/prometheus prom/prometheus:v2.55.1 \
promtool check config /etc/prometheus/prometheus.yml
docker run --rm -v "$PWD/prometheus/rules":/rules prom/prometheus:v2.55.1 \
promtool check rules /rules/alerts.ymlBoth must end with SUCCESS. If anything fails, fix it before continuing — the stack will not start with invalid YAML.
docker compose up -d
docker compose psAll 5 containers must show Status: Up. If any is Restarting, check the logs:
docker compose logs -f alertmanager
docker compose logs -f prometheus# Prometheus healthy
curl -s http://localhost:9090/-/healthy
# Expected: Prometheus Server is Healthy.
# Alertmanager healthy
curl -s http://localhost:9093/-/healthy
# Expected: OK
# node_exporter exposing metrics (network_mode: host)
curl -s http://localhost:9100/metrics | head -20
# SSH metrics reaching Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=ssh_active_sessions' | jq
# Prometheus targets (all must be UP)
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health}'From your notebook (on Tailscale):
http://asymptora-prod-01:3000
Login: admin / the password set in .env. Navigate to Dashboards → Asymptora → Node Exporter Full. You should see live CPU, RAM, disk, and network metrics.
The first heartbeat fires on the next evaluation cycle (~30s after the stack is up). Within a few minutes, a green message should appear in #-infra-status. After that, one message every 6 hours.
curl -XPOST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[
{
"labels": {
"alertname": "ManualTest",
"severity": "critical",
"instance": "asymptora-prod-01",
"category": "test"
},
"annotations": {
"summary": "Manual test alert",
"description": "Validating routing to #-incidentes + ntfy."
}
}
]'Should appear in #-incidentes and push to the phone within ~30 seconds.
In another terminal, open a new SSH session to the server. Within 1 minute, #-incidentes receives the SSHSessionOpened alert.
ssh nonexistent_user@asymptora-prod-01 # 6 timesAfter 5 failures in 5 minutes, SSHFailedLoginsBurst fires in #-incidentes.
When an alert lands, this is the minimum sequence to execute before taking any action:
- Read the whole alert (host, category, severity, description).
- Open Grafana → host dashboard → last 1h.
- SSH into the host and run the first 60 seconds sequence:
uptime ; dmesg -T | tail -20 ; vmstat 1 5 ; mpstat -P ALL 1 3
iostat -xz 1 3 ; free -h ; sar -n DEV 1 3 ; ss -s ; top -bn1 | head -20 ; df -h- Correlate Grafana ↔ terminal: does the dashboard agree with
vmstat/iostat? - Document everything in a post-mortem (even for false alarms).
# After editing prometheus.yml or alerts.yml
curl -X POST http://localhost:9090/-/reload
# After editing alertmanager.tmpl.yml
docker compose restart alertmanagercurl -s http://localhost:9093/api/v2/alerts | jqdocker compose down
sudo tar czf /var/backups/observability-$(date +%F).tar.gz \
/var/lib/docker/volumes/observability_prometheus_data \
/var/lib/docker/volumes/observability_grafana_data
docker compose up -ddocker compose pull
docker compose up -dobservability-stack/
├── .env.example
├── .gitignore
├── docker-compose.yml
├── README.md
├── alertmanager/
│ ├── alertmanager.tmpl.yml # template (envsubst expands ${VAR})
│ └── entrypoint.sh # runs envsubst and starts alertmanager
├── grafana/
│ ├── dashboards/ # JSONs downloaded by fetch-dashboards.sh
│ └── provisioning/
│ ├── dashboards/dashboards.yml
│ └── datasources/prometheus.yml
├── ntfy/
│ └── server.yml
├── prometheus/
│ ├── prometheus.yml
│ └── rules/
│ └── alerts.yml # 18 alert rules (17 incidents + 1 heartbeat)
├── scripts/
│ ├── fetch-dashboards.sh # downloads dashboards from grafana.com
│ └── ssh-metrics.sh # textfile collector for SSH
└── systemd/
├── ssh-metrics.service
└── ssh-metrics.timer
Why network_mode: host on node_exporter. It is the only way to collect real host network metrics — in bridge mode, it would only see the container's virtual interface.
Why envsubst on Alertmanager. Alertmanager does not expand ${VAR} in every YAML field. The custom entrypoint renders the template before starting the process, ensuring DISCORD_WEBHOOK_STATUS, DISCORD_WEBHOOK_INCIDENTS, and NTFY_TOPIC are injected safely from the environment.
Why two Discord channels. #-infra-status answers "is it alive?", #-incidentes answers "do I need to act?". Mixing them trains people to ignore notifications and causes real alerts to be missed. This is a well-established anti-alert-fatigue pattern.
Why ntfy and Discord for critical. Discord is asynchronous and muted outside working hours. ntfy bypasses Do Not Disturb and ensures someone wakes up when asymptora-prod-01 goes down at 3 AM.
Why Prometheus on 127.0.0.1:9090. Prometheus has no native authentication. Binding to loopback forces access over Tailscale + SSH tunnel (ssh -L 9090:localhost:9090), keeping the UI private.
What is out of scope for this module. Loki/Promtail (centralized logs) and cAdvisor (container metrics) — these come in M8 once the real stack starts running services.
- Add Loki + Promtail for centralized logs and LogQL-based alerts
- Add blackbox_exporter to probe external endpoints
- Add cAdvisor once real containers land on the server
- Define SLIs/SLOs based on the historical data collected here
- Document real incidents in
incident-log/as post-mortems - Replace the
vector(1)heartbeat with a Python job that posts a real metrics summary (uptime, load, disk, memory) to#-infra-statusvia the Prometheus HTTP API — natural project for M1 (Python for DevOps)