-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Summary
Static thresholds ("alert when CPU > 80%") generate alert fatigue because they don't account for normal workload patterns. CPU at 85% might be expected at 2pm Tuesday during month-end processing but alarming at 3am Sunday. Dynamic baselines learn what "normal" looks like for each time window and flag deviations from that pattern.
PerformanceMonitor's `compare_analysis` already does a primitive version of this (compare current 4 hours vs same window 28 hours ago). This issue tracks making baselines time-of-day and day-of-week aware.
Core Concept
- Collect 30+ days of historical metrics
- Build per-metric baselines segmented by time-of-day and day-of-week (e.g., "Tuesday 2pm CPU is typically 70-85%")
- Compute confidence bands (e.g., mean ± 2-3 standard deviations)
- Flag current values that fall outside the expected band for this specific time window
- Continuously update baselines as workloads evolve
Which Metrics to Baseline
High value (clear daily/weekly patterns)
- CPU utilization
- Batch Requests/sec
- Wait stats (total wait time per type)
- Session/connection counts
- Query duration aggregates
Medium value
- Memory utilization (tends to be more stable)
- I/O latency
- TempDB usage
- Blocking event counts
Where This Applies
Analysis Engine (both Dashboard and Lite)
The inference engine's fact scoring could incorporate baseline deviation as an amplifier. A CPU reading of 85% with a baseline of 80±5% scores low (normal). The same 85% with a baseline of 40±10% scores high (anomalous). This makes the engine's findings context-aware without changing the rule structure.
Alert Thresholds
Instead of fixed thresholds, alerts could fire on "deviation from baseline exceeds N standard deviations." This directly addresses alert fatigue — the #1 cited barrier to faster incident response (per 2024 industry survey).
Trend Charts (both Dashboard and Lite)
Overlay a shaded "expected range" band on metric charts. Visually, the user sees the metric line and a band showing what's normal. When the line exits the band, something changed. This is the visual equivalent of the annotation markers from issue #688 but for statistical context rather than discrete events.
compare_analysis Enhancement
The existing `compare_analysis` MCP tool compares two time windows. With baselines, it could compare the current window against the expected baseline for this time of day/week rather than a fixed offset, making the comparison more meaningful.
Data Requirements
Dashboard
Historical data is already in the `PerformanceMonitor` SQL Server database. Baseline computation could be a scheduled calculation (SQL Agent job or application-level) that maintains a baseline table with per-metric, per-hour-of-day, per-day-of-week statistics.
Lite
Historical data is in DuckDB/Parquet. Baseline computation could run as part of the collector cycle or on-demand. DuckDB's analytical query capabilities make time-bucketed aggregation efficient.
Both apps need at least 2-4 weeks of data before baselines become meaningful. New installations should gracefully degrade to static thresholds until sufficient history exists.
Design Notes
- Start simple: mean and standard deviation per metric per hour-of-day per day-of-week
- More sophisticated approaches (seasonal decomposition, exponential smoothing) can come later
- The baseline computation itself is not computationally expensive — it's aggregating data that's already stored
- The UX challenge is communicating "this is unusual for this time" vs "this crossed a fixed threshold" — the shaded band on charts is the clearest way
- Applies to both Dashboard and Lite, plus MCP analysis tools