An intelligent SQL Server backup monitoring solution that uses machine learning anomaly detection and predictive analytics to identify performance issues, SLA violations, and potential backup failures.
- Anomaly Detection: Uses Isolation Forest ML model to identify unusual backup patterns
- Performance Analysis: Monitors I/O throughput, duration, compression efficiency, and growth rates
- SLA Monitoring: Tracks backup duration against configurable SLA thresholds
- Failure Prediction: Predicts potential backup failures based on size growth and I/O metrics
- Root Cause Analysis: Provides intelligent hints about the underlying causes of anomalies
- LSN Chain Validation: Detects broken transaction log backup chains
- Alert Prioritization: Automatically assigns severity levels (CRITICAL, HIGH, MEDIUM, LOW, INFO)
- Multi-Database Support: Monitors multiple databases simultaneously with per-database analysis
- Backup Duration: Time taken for backup operations (in seconds)
- Backup Size: Uncompressed and compressed backup sizes (in MB)
- I/O Throughput: Backup speed measured in MB/s
- Compression Ratio: Effectiveness of backup compression
- Database Growth Rate: Change in backup size between runs
- Rolling Averages: 7-day moving averages for duration and throughput
- Deviation Scores: Percentage deviation from rolling baselines
- Seconds Per GB: Normalized performance metric (duration per GB of data)
- Compression Efficiency Changes: Deviations in compression behavior
| Priority | Condition | Action |
|---|---|---|
| CRITICAL | Broken LSN chain in transaction logs | Investigate immediately - recovery chain is broken |
| HIGH | SLA violation (duration exceeds threshold) or predicted failure risk | Address performance or scaling issues |
| MEDIUM | I/O bottleneck detected | Monitor storage performance |
| LOW | ML-detected anomalies without other risk factors | Review for potential issues |
| INFO | Normal operation | No action needed |
- Python 3.7+
- SQL Server (2016 or later)
- ODBC Driver 17 for SQL Server
pyodbc>=4.0.0
pandas>=1.0.0
numpy>=1.18.0
scikit-learn>=0.22.0-
Clone the repository
git clone https://github.com/yourusername/ai-backup-monitoring.git cd ai-backup-monitoring -
Install dependencies
pip install -r requirements.txt
-
Configure database connection Edit the connection string in the script:
conn = pyodbc.connect( "DRIVER={ODBC Driver 17 for SQL Server};" "SERVER=your_server_name;" "DATABASE=your_database_name;" "UID=your_username;" "PWD=your_password;" )
-
Ensure backup history table exists The script expects a table named
dbo.BackupMonitoringHistorywith the following columns:- BackupID
- DatabaseName
- BackupStartDate
- BackupFinishDate
- BackupType
- DurationSeconds
- BackupSizeMB
- CompressedBackupSizeMB
- ChecksumStatus
- LastLSN
- FirstLSN
python backup_monitoring.pyThe script prints a formatted table of detected anomalies with the following columns:
- BackupID: Unique identifier for the backup
- DatabaseName: Name of the database
- BackupType: Type of backup (FULL, DIFF, LOG, etc.)
- BackupStartDate: When the backup started
- AlertPriority: Severity level of the alert
- IO_Throughput_MBPS: Backup speed in MB/s
- ThroughputDeviation: Percentage deviation from baseline
- SecondsPerGB: Performance metric (seconds per GB)
- IO_Risk: I/O bottleneck status
- PredictedFailureRisk: Risk assessment for failure
- SLA_Risk: SLA violation status
- RootCauseHint: Suggested reason for the anomaly
- AnomalyReason: Feature with highest deviation
========== AI BACKUP MONITORING RESULTS ==========
BackupID DatabaseName BackupType BackupStartDate AlertPriority ... RootCauseHint
0 1234 MyDB FULL 2024-01-15 23:00:00 HIGH ... I/O bottleneck: throughput dropped
1 1235 MyDB LOG 2024-01-16 00:15:00 CRITICAL ... No obvious issue
SLA Threshold (line 131)
SLA_SECONDS = 3600 # 1 hourAnomaly Detection Sensitivity (line 139)
contamination=0.03, # Expect ~3% of backups to be anomaliesRolling Window (line 82-93)
x.rolling(7, min_periods=1).mean() # 7-day rolling averageGrowth Rate Cap (line 59)
df["GrowthRate_Capped"] = df["GrowthRate"].clip(-5, 5) # ±5xFor stricter monitoring, decrease contamination:
contamination=0.01 # Only ~1% flagged as anomaliesFor more lenient detection, increase contamination:
contamination=0.05 # ~5% flagged as anomalies- Data Collection: Queries last 30 days of backup history from SQL Server
- Feature Engineering: Calculates performance metrics and deviations
- Anomaly Detection: Uses Isolation Forest to identify unusual patterns
- Risk Assessment: Evaluates SLA compliance and failure risk
- Root Cause Analysis: Suggests probable causes for detected issues
- Alert Prioritization: Assigns severity based on multiple risk factors
- Results Aggregation: Displays only problematic backups
The system provides intelligent hints based on detected patterns:
- Compression disabled: Compression ratio below 1
- I/O bottleneck: Throughput dropped while backup size stable
- Storage slowdown: Time per GB increasing significantly
- Possible storage bottleneck: Low throughput with long duration
- Throughput degradation: Significant drop from baseline
- Compression behavior change: Deviation in compression efficiency
- Backup size increase: Rapid database growth detected
If your backup history table doesn't exist, create it with:
CREATE TABLE dbo.BackupMonitoringHistory (
BackupID INT PRIMARY KEY,
DatabaseName NVARCHAR(128),
BackupStartDate DATETIME2,
BackupFinishDate DATETIME2,
BackupType NVARCHAR(20),
DurationSeconds INT,
BackupSizeMB BIGINT,
CompressedBackupSizeMB BIGINT,
ChecksumStatus INT,
LastLSN NUMERIC(25,0),
FirstLSN NUMERIC(25,0)
);Populate it by querying msdb.dbo.backupset and related tables.
python C:\path\to\backup_monitoring.py >> C:\logs\backup_monitoring.log0 1 * * * python /path/to/backup_monitoring.py >> /var/log/backup_monitoring.logPipe output to a logging system or email alerting service for critical alerts.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add improvement') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
- Email alerting for critical backups
- Dashboard visualization with historical trends
- Database-specific SLA configuration
- Integration with monitoring platforms (Grafana, Datadog, etc.)
- Support for other SQL Server versions
- Configuration file support (JSON/YAML)
This project is licensed under the MIT License - see the LICENSE file for details.
Connection Error: "ODBC Driver 17 not found"
- Install ODBC Driver 17 for SQL Server from Microsoft
No anomalies detected
- This is normal! It means backups are running smoothly
- Adjust
contaminationparameter to detect more subtle issues - Check that backup history table is populated with recent data
High false positive rate
- Increase
contaminationvalue (e.g., from 0.03 to 0.05) - Review and adjust threshold values for your environment
Memory issues with large datasets
- Reduce the time window in the SQL query (e.g., -14 instead of -30 days)
- Process databases individually
For issues, questions, or feature requests, please open an issue on GitHub.
- Scikit-learn for machine learning capabilities
- Pandas for data manipulation
- SQL Server community for feedback and use cases