# Training-time Monitoring

This notebook covers essential practices for monitoring your PyTorch models during training phase.

## Topics Covered

1. **Experiment Tracking**
   - Weights & Biases (WandB)
   - MLflow
   - Neptune

2. **Logging Best Practices**
   - Configuration logging
   - Dataset versions
   - Metrics tracking
   - Artifacts management
   - System metrics (GPU/CPU/memory)

3. **Data Validation**
   - Great Expectations
   - Schema checks
   - Label distribution checks

4. **Performance Profiling**
   - torch.profiler
   - TensorBoard traces
   - DataLoader bottlenecks


## 1. Experiment Tracking with Weights & Biases


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import wandb
import os
from datetime import datetime

# Initialize wandb
wandb.init(
    project="pytorch-monitoring-demo",
    config={
        "learning_rate": 0.001,
        "epochs": 10,
        "batch_size": 32,
        "architecture": "CNN",
        "dataset": "CIFAR-10"
    }
)


## 2. System Metrics Logging


In [None]:
import psutil
import GPUtil

def log_system_metrics():
    """Log system metrics during training"""
    # CPU metrics
    cpu_percent = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory()
    
    # GPU metrics (if available)
    gpu_metrics = {}
    if torch.cuda.is_available():
        gpus = GPUtil.getGPUs()
        if gpus:
            gpu = gpus[0]
            gpu_metrics = {
                "gpu_utilization": gpu.load * 100,
                "gpu_memory_used": gpu.memoryUsed,
                "gpu_memory_total": gpu.memoryTotal,
                "gpu_temperature": gpu.temperature
            }
    
    metrics = {
        "system/cpu_percent": cpu_percent,
        "system/memory_percent": memory.percent,
        "system/memory_used_gb": memory.used / (1024**3),
        **gpu_metrics
    }
    
    return metrics

# Example usage
system_metrics = log_system_metrics()
print("System Metrics:", system_metrics)
