# 🚀 NVIDIA DGX Spark Optimized HVAC Training Pipeline

## YOLOplan + YOLO11 + Roboflow on DGX Infrastructure

### 🎯 Purpose
This notebook is specifically optimized for **NVIDIA DGX Spark** infrastructure with multi-GPU support (A100/H100). It provides:
- Optimal GPU utilization without overloading Spark resources
- High-performance training with mixed precision and distributed support
- Enterprise-grade model training for HVAC blueprint detection

### 🖥️ Hardware Requirements
- **Platform**: NVIDIA DGX Station/Server with Spark
- **GPU**: 1-8x NVIDIA A100 (40/80GB) or H100 (80GB)
- **Memory**: 256GB+ system RAM recommended
- **Storage**: High-speed NVMe for dataset storage

### ⚙️ Optimizations
- **Multi-GPU Support**: Efficient distribution across available GPUs
- **Memory Management**: Controlled resource usage to prevent Spark overload
- **Data Loading**: Optimized for NVMe storage with parallel workers
- **Mixed Precision**: FP16 training for 2-3x speedup
- **TensorRT Ready**: Models can be exported to TensorRT for inference

### 📋 Prerequisites
```bash
# Ensure CUDA toolkit is available
nvidia-smi

# Python environment with required packages
pip install ultralytics roboflow pyyaml pandas matplotlib seaborn
```

### 🔧 Configuration Notes
- **Batch Size**: Automatically scaled based on available GPU memory
- **Workers**: Set to CPU count / GPU count for optimal throughput
- **Storage Paths**: Uses local NVMe paths (no cloud storage)
- **Monitoring**: TensorBoard for real-time metrics


In [None]:
# --- ENVIRONMENT VERIFICATION ---
import os
import sys
import torch
import subprocess

print("="*70)
print("🖥️  DGX SPARK ENVIRONMENT CHECK")
print("="*70)

# Check CUDA availability
if torch.cuda.is_available():
    gpu_count = torch.cuda.device_count()
    print(f"✅ CUDA Available: {torch.version.cuda}")
    print(f"✅ PyTorch Version: {torch.__version__}")
    print(f"✅ GPU Count: {gpu_count}")
    print("\n📊 GPU Details:")
    for i in range(gpu_count):
        props = torch.cuda.get_device_properties(i)
        print(f"   GPU {i}: {props.name}")
        print(f"      Memory: {props.total_memory / 1e9:.2f} GB")
        print(f"      Compute Capability: {props.major}.{props.minor}")
else:
    print("❌ ERROR: No CUDA-capable GPU detected!")
    print("   This notebook requires NVIDIA GPU support.")
    sys.exit(1)

# Check cuDNN
if torch.backends.cudnn.is_available():
    print(f"\n✅ cuDNN Available: {torch.backends.cudnn.version()}")
    print(f"   cuDNN Enabled: {torch.backends.cudnn.enabled}")

# System info
print("\n💻 System Resources:")
try:
    cpu_count = os.cpu_count()
    print(f"   CPU Cores: {cpu_count}")
    
    # Get memory info on Linux
    with open('/proc/meminfo', 'r') as f:
        meminfo = f.read()
        for line in meminfo.split('\n'):
            if 'MemTotal' in line:
                mem_total = int(line.split()[1]) / 1e6
                print(f"   System RAM: {mem_total:.1f} GB")
                break
except:
    print("   (System info unavailable)")

# Check storage
print("\n💾 Storage Check:")
workspace_dir = os.path.expanduser('~/hvac_workspace')
if not os.path.exists(workspace_dir):
    os.makedirs(workspace_dir, exist_ok=True)
    print(f"   Created workspace: {workspace_dir}")
else:
    print(f"   Workspace exists: {workspace_dir}")

# Set environment variables for optimal performance
os.environ['CUDA_LAUNCH_BLOCKING'] = '0'  # Async kernel launches
os.environ['TORCH_USE_CUDA_DSA'] = '1'  # Device-side assertions

print("\n" + "="*70)
print("✅ ENVIRONMENT CHECK COMPLETE")
print("="*70)
print(f"\n📁 Working Directory: {workspace_dir}")
print("\n🚀 Ready to proceed with training!\n")


In [None]:
# --- STEP 1: ENVIRONMENT SETUP FOR DGX ---
import os
import sys
import subprocess

print("="*70)
print("📦 INSTALLING DEPENDENCIES")
print("="*70)

# Set workspace directory
WORKSPACE_DIR = os.path.expanduser('~/hvac_workspace')
os.makedirs(WORKSPACE_DIR, exist_ok=True)
os.chdir(WORKSPACE_DIR)

print(f"\n📁 Workspace: {WORKSPACE_DIR}")

# Install/upgrade required packages
packages = [
    'ultralytics',  # YOLO11
    'roboflow',     # Dataset management
    'pyyaml',       # Configuration
    'pandas',       # Data analysis
    'matplotlib',   # Plotting
    'seaborn',      # Visualization
    'tensorboard',  # Monitoring
    'tqdm',         # Progress bars
]

print("\n🔧 Installing/upgrading packages...")
for pkg in packages:
    print(f"   - {pkg}")
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '--upgrade', pkg], 
                   check=False)

print("\n✅ Package installation complete")

# Clone YOLOplan repository if not exists
yoloplan_dir = os.path.join(WORKSPACE_DIR, 'YOLOplan')
if not os.path.exists(yoloplan_dir):
    print("\n📥 Cloning YOLOplan repository...")
    subprocess.run(['git', 'clone', 'https://github.com/cvicream/YOLOplan.git', yoloplan_dir],
                   check=False, capture_output=True)
    if os.path.exists(yoloplan_dir):
        print(f"   ✅ Cloned to: {yoloplan_dir}")
    else:
        print("   ⚠️ YOLOplan clone skipped (may already exist or network issue)")
else:
    print(f"\n✅ YOLOplan already exists: {yoloplan_dir}")

# Add YOLOplan to Python path
if yoloplan_dir not in sys.path:
    sys.path.insert(0, yoloplan_dir)

# Verify installations
print("\n🔍 Verifying installations...")
try:
    from ultralytics import YOLO
    import roboflow
    import yaml
    print("   ✅ All required packages imported successfully")
except ImportError as e:
    print(f"   ❌ Import error: {e}")
    sys.exit(1)

print("\n" + "="*70)
print("✅ ENVIRONMENT SETUP COMPLETE")
print("="*70)
print(f"\n📂 Working in: {os.getcwd()}")


In [None]:
# --- STEP 2: SECURE DATA DOWNLOAD ---
from roboflow import Roboflow
import os
import getpass

print("="*70)
print("📥 ROBOFLOW DATASET DOWNLOAD")
print("="*70)

# Get Roboflow credentials
# Priority: 1) Environment variables, 2) User input
RF_API_KEY = os.environ.get('ROBOFLOW_API_KEY')
RF_WORKSPACE = os.environ.get('ROBOFLOW_WORKSPACE', 'hvac-detection')
RF_PROJECT = os.environ.get('ROBOFLOW_PROJECT', 'hvac-blueprint-analysis')
RF_VERSION = os.environ.get('ROBOFLOW_VERSION', '1')

if not RF_API_KEY:
    print("⚠️ ROBOFLOW_API_KEY not found in environment variables")
    print("\nYou can set it in your environment:")
    print("   export ROBOFLOW_API_KEY='your-api-key'")
    print("\nOr enter it now (input will be hidden):")
    RF_API_KEY = getpass.getpass("Roboflow API Key: ")

if not RF_API_KEY:
    print("\n❌ ERROR: Roboflow API key is required")
    raise ValueError("Missing Roboflow API key")

print(f"\n🔐 Using Roboflow credentials:")
print(f"   Workspace: {RF_WORKSPACE}")
print(f"   Project: {RF_PROJECT}")
print(f"   Version: {RF_VERSION}")

# Set download location (local NVMe storage)
DATASET_DIR = os.path.join(os.getcwd(), 'dataset')
os.makedirs(DATASET_DIR, exist_ok=True)

print(f"\n📂 Download location: {DATASET_DIR}")

# Download dataset
try:
    print("\n⬇️ Downloading dataset from Roboflow...")
    rf = Roboflow(api_key=RF_API_KEY)
    project = rf.workspace(RF_WORKSPACE).project(RF_PROJECT)
    dataset = project.version(RF_VERSION).download(
        model_format="coco-segmentation",
        location=DATASET_DIR,
        overwrite=False  # Skip if already downloaded
    )
    
    DATASET_PATH = dataset.location
    print(f"\n✅ Dataset downloaded successfully")
    print(f"   Location: {DATASET_PATH}")
    
    # Verify dataset structure
    print("\n📊 Dataset structure:")
    for split in ['train', 'valid', 'test']:
        split_dir = os.path.join(DATASET_PATH, split)
        if os.path.exists(split_dir):
            img_count = len([f for f in os.listdir(split_dir) if f.endswith(('.jpg', '.png', '.jpeg'))])
            print(f"   {split}: {img_count} images")
    
    # Check for annotations
    annotations_file = os.path.join(DATASET_PATH, 'train', '_annotations.coco.json')
    if os.path.exists(annotations_file):
        import json
        with open(annotations_file, 'r') as f:
            coco_data = json.load(f)
        print(f"\n📝 Annotations:")
        print(f"   Images: {len(coco_data.get('images', []))}")
        print(f"   Annotations: {len(coco_data.get('annotations', []))}")
        print(f"   Categories: {len(coco_data.get('categories', []))}")
        
        # List categories
        if coco_data.get('categories'):
            print("\n🏷️ Classes:")
            for cat in coco_data['categories'][:10]:  # Show first 10
                print(f"      {cat['id']}: {cat['name']}")
            if len(coco_data['categories']) > 10:
                print(f"      ... and {len(coco_data['categories']) - 10} more")
    
except Exception as e:
    print(f"\n❌ ERROR downloading dataset: {e}")
    raise

print("\n" + "="*70)
print("✅ DATA DOWNLOAD COMPLETE")
print("="*70)

# Store for next cells
globals()['DATASET_PATH'] = DATASET_PATH
globals()['RF_VERSION'] = RF_VERSION


In [None]:
# --- STEP 2.5: AGGRESSIVE DATASET REPAIR & VALIDATION ---
import os
import glob
import shutil
import json
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from collections import defaultdict

DATASET_ROOT = "/content/hvac_dataset"
SPLITS = ['train', 'valid', 'test']

print("="*60)
print("🩺 RUNNING AGGRESSIVE DATASET REPAIR & VALIDATION...")
print("="*60)

dataset_stats = {}

for split in SPLITS:
    split_path = os.path.join(DATASET_ROOT, split)
    json_path = os.path.join(split_path, "_annotations.coco.json")
    images_dir = os.path.join(split_path, "images")

    if not os.path.isdir(split_path): 
        continue

    print(f"\n--- Processing '{split}' split ---")

    # 1. ENFORCE FOLDER STRUCTURE
    if not os.path.isdir(images_dir):
        os.makedirs(images_dir, exist_ok=True)
        image_files = glob.glob(os.path.join(split_path, '*.*'))
        image_files = [f for f in image_files if f.lower().endswith(('.jpg', '.jpeg', '.png', '.bmp'))]

        if image_files:
            print(f"   Moving {len(image_files)} images to '{images_dir}'...")
            for img in tqdm(image_files, leave=False):
                try:
                    shutil.move(img, images_dir)
                except shutil.Error:
                    pass

    # 2. REPAIR JSON FILENAMES & COLLECT STATISTICS
    if os.path.exists(json_path):
        print(f"   Repairing JSON metadata in: {os.path.basename(json_path)}")
        try:
            with open(json_path, 'r') as f:
                data = json.load(f)

            modified = False
            valid_images = []
            class_counts = defaultdict(int)
            files_on_disk = set(os.listdir(images_dir))

            for img_entry in data['images']:
                original_name = img_entry['file_name']
                clean_name = os.path.basename(original_name)

                if original_name != clean_name:
                    img_entry['file_name'] = clean_name
                    modified = True

                if clean_name in files_on_disk:
                    valid_images.append(img_entry)

            # Count annotations per class
            for ann in data['annotations']:
                cat_id = ann['category_id']
                cat_name = next((c['name'] for c in data['categories'] if c['id'] == cat_id), 'unknown')
                class_counts[cat_name] += 1

            if len(valid_images) < len(data['images']):
                print(f"      ⚠️ Removed {len(data['images']) - len(valid_images)} entries from JSON that had no matching image file.")
                data['images'] = valid_images
                modified = True

            if modified:
                with open(json_path, 'w') as f:
                    json.dump(data, f)
                print("      ✅ JSON fixed: Removed path prefixes and synced with disk.")
            else:
                print("      ✅ JSON was already correct.")

            # Store statistics
            dataset_stats[split] = {
                'num_images': len(valid_images),
                'num_annotations': len(data['annotations']),
                'num_classes': len(data['categories']),
                'class_counts': dict(class_counts),
                'categories': {c['id']: c['name'] for c in data['categories']}
            }

            print(f"      📊 Statistics: {len(valid_images)} images, {len(data['annotations'])} annotations")

        except Exception as e:
            print(f"      ❌ Failed to parse JSON: {e}")
    else:
        print(f"      ❌ ERROR: Annotation file missing: {json_path}")

print("\n" + "="*60)
print("🎉 REPAIR COMPLETE. Dataset ready for training.")
print("="*60)

# 3. VISUALIZE DATASET STATISTICS
if dataset_stats:
    print("\n📊 DATASET STATISTICS:")
    for split, stats in dataset_stats.items():
        print(f"\n{split.upper()}:")
        print(f"  Images: {stats['num_images']}")
        print(f"  Annotations: {stats['num_annotations']}")
        print(f"  Classes: {stats['num_classes']}")
        print(f"  Avg annotations per image: {stats['num_annotations']/max(stats['num_images'],1):.2f}")
    
    # Plot class distribution for training set
    if 'train' in dataset_stats:
        fig, ax = plt.subplots(figsize=(12, 6))
        class_counts = dataset_stats['train']['class_counts']
        sorted_classes = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
        classes, counts = zip(*sorted_classes)
        
        ax.bar(range(len(classes)), counts)
        ax.set_xticks(range(len(classes)))
        ax.set_xticklabels(classes, rotation=45, ha='right')
        ax.set_xlabel('Class')
        ax.set_ylabel('Number of Annotations')
        ax.set_title('Training Set: Class Distribution')
        plt.tight_layout()
        plt.show()
        
        # Check for class imbalance
        max_count = max(counts)
        min_count = min(counts)
        if max_count / min_count > 10:
            print(f"\n⚠️ WARNING: Class imbalance detected! Ratio: {max_count/min_count:.1f}x")
            print(f"   Consider using weighted loss or copy_paste augmentation.")

In [None]:
# --- STEP 3: CONVERT COCO TO YOLO TXT & GENERATE CONFIG ---
import os
import glob
import shutil
import json
import yaml
from tqdm.notebook import tqdm

DATASET_ROOT = "/content/hvac_dataset"
SPLITS = ['train', 'valid', 'test']
OUTPUT_YAML_PATH = "/content/hvac_config.yaml"

print("⚙️ CONVERTING COCO JSON TO YOLO TXT FORMAT...")

def convert_coco_to_yolo(json_path, output_labels_dir):
    """Convert COCO format annotations to YOLO segmentation format."""
    if not os.path.exists(json_path): 
        return False

    with open(json_path, 'r') as f:
        data = json.load(f)

    images = {img['id']: img for img in data['images']}
    categories = {cat['id']: idx for idx, cat in enumerate(sorted(data['categories'], key=lambda x: x['id']))}

    # Group annotations by image
    img_annotations = {}
    for ann in data['annotations']:
        img_id = ann['image_id']
        if img_id not in img_annotations: 
            img_annotations[img_id] = []
        img_annotations[img_id].append(ann)

    # Write TXT files
    os.makedirs(output_labels_dir, exist_ok=True)
    count = 0

    for img_id, anns in img_annotations.items():
        img_info = images[img_id]
        img_w, img_h = img_info['width'], img_info['height']
        filename = os.path.basename(img_info['file_name'])
        txt_name = os.path.splitext(filename)[0] + ".txt"

        with open(os.path.join(output_labels_dir, txt_name), 'w') as f:
            for ann in anns:
                cat_id = categories[ann['category_id']]

                # Convert Polygon to Normalized Coordinates
                # YOLO Seg format: <class> <x1> <y1> <x2> <y2> ...
                segmentation = ann['segmentation'][0]
                normalized_points = []
                for i in range(0, len(segmentation), 2):
                    x = segmentation[i] / img_w
                    y = segmentation[i+1] / img_h
                    normalized_points.append(f"{x:.6f} {y:.6f}")

                f.write(f"{cat_id} " + " ".join(normalized_points) + "\n")
        count += 1
    return count

# 1. EXECUTE CONVERSION
for split in SPLITS:
    split_path = os.path.join(DATASET_ROOT, split)
    json_path = os.path.join(split_path, "_annotations.coco.json")
    labels_dir = os.path.join(split_path, "labels")
    images_dir = os.path.join(split_path, "images")

    # Move images if needed (Sanitization)
    if not os.path.isdir(images_dir):
        os.makedirs(images_dir, exist_ok=True)
        files = glob.glob(os.path.join(split_path, '*.*'))
        files = [f for f in files if f.lower().endswith(('.jpg', '.png', '.jpeg'))]
        for f in files: 
            shutil.move(f, images_dir)

    # Convert Labels
    if os.path.exists(json_path):
        num_converted = convert_coco_to_yolo(json_path, labels_dir)
        print(f"   ✅ Converted {num_converted} labels for '{split}'")
    else:
        print(f"   ⚠️ No JSON found for '{split}'")

# 2. GENERATE CONFIG
print("⚙️ GENERATING LOCAL CONFIG...")
try:
    # Get class names from train JSON
    with open(os.path.join(DATASET_ROOT, "train", "_annotations.coco.json"), 'r') as f:
        coco_data = json.load(f)
    class_names = [cat['name'] for cat in sorted(coco_data['categories'], key=lambda x: x['id'])]

    config = {
        'path': DATASET_ROOT,
        'train': "train/images",
        'val': "valid/images",
        'test': "test/images",
        'nc': len(class_names),
        'names': class_names
    }
    with open(OUTPUT_YAML_PATH, 'w') as f:
        yaml.dump(config, f, sort_keys=False)
    print(f"✅ Config saved to: {OUTPUT_YAML_PATH}")
    print(f"   Classes: {len(class_names)}")
    print(f"   First 5 classes: {class_names[:5]}")

except Exception as e:
    print(f"❌ Config Generation Failed: {e}")

In [None]:
# --- STEP 3: DGX SPARK OPTIMIZED TRAINING CONFIGURATION ---
import yaml
import torch
import os
from datetime import datetime

print("="*70)
print("⚙️  CREATING DGX SPARK OPTIMIZED CONFIGURATION")
print("="*70)

# Detect GPU configuration
gpu_count = torch.cuda.device_count()
gpu_name = torch.cuda.get_device_name(0) if gpu_count > 0 else 'unknown'
gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9 if gpu_count > 0 else 0

print(f"\n🖥️ Detected Hardware:")
print(f"   GPUs: {gpu_count}x {gpu_name}")
print(f"   Memory per GPU: {gpu_memory_gb:.1f} GB")

# Determine optimal settings based on GPU
# These are conservative to avoid overloading Spark
if 'A100' in gpu_name:
    if gpu_memory_gb >= 80:
        # A100 80GB
        batch_per_gpu = 16
        imgsz = 1280
    else:
        # A100 40GB
        batch_per_gpu = 12
        imgsz = 1024
elif 'H100' in gpu_name:
    # H100 80GB
    batch_per_gpu = 20
    imgsz = 1280
elif 'V100' in gpu_name:
    # V100 32GB
    batch_per_gpu = 8
    imgsz = 1024
else:
    # Conservative defaults
    batch_per_gpu = 4
    imgsz = 1024

# Calculate total batch size (conservative for Spark)
# Use max 4 GPUs to avoid overloading Spark cluster
gpus_to_use = min(gpu_count, 4)
total_batch = batch_per_gpu * gpus_to_use

# Calculate optimal workers (conservative for Spark)
cpu_count = os.cpu_count() or 8
workers_per_gpu = max(2, min(4, cpu_count // gpus_to_use))
total_workers = workers_per_gpu * gpus_to_use

print(f"\n🎯 Optimized Settings:")
print(f"   GPUs to use: {gpus_to_use} of {gpu_count} available")
print(f"   Batch size: {batch_per_gpu} per GPU (total: {total_batch})")
print(f"   Image size: {imgsz}")
print(f"   Workers: {workers_per_gpu} per GPU (total: {total_workers})")
print(f"   ⚠️ Resource usage limited to avoid Spark overload")

# Set paths for DGX local storage
workspace_dir = os.path.expanduser('~/hvac_workspace')
data_yaml_path = os.path.join(workspace_dir, 'hvac_config.yaml')
project_dir = os.path.join(workspace_dir, 'runs', 'segment')
run_name = f'hvac_yolo11_dgx_{datetime.now().strftime("%Y%m%d_%H%M")}'
# Create comprehensive training configuration optimized for DGX Spark
training_config = {
    'metadata': {
        'created_at': datetime.now().isoformat(),
        'platform': 'NVIDIA DGX Spark',
        'gpu_model': gpu_name,
        'gpu_count': gpus_to_use,
        'dataset_version': globals().get('RF_VERSION', '1'),
        'description': 'DGX Spark optimized HVAC YOLO11 segmentation training'
    },
    
    'paths': {
        'data_yaml': data_yaml_path,
        'project_dir': project_dir,
        'run_name': run_name
    },
    
    'model': {
        'architecture': 'yolo11m-seg.pt',  # Medium model for balance
        'pretrained': True,
        'freeze_layers': None  # No freezing for full fine-tuning
    },
    
    'hardware': {
        'imgsz': imgsz,
        'batch': total_batch,
        'workers': total_workers,
        'cache': False,  # Use disk to save RAM for Spark
        'amp': True,     # Mixed precision (FP16) for 2-3x speedup
        'device': list(range(gpus_to_use)),  # Use multiple GPUs
        'cudnn_benchmark': True,  # Enable cuDNN autotuner
    },
    
    'training': {
        'epochs': 150,       # More epochs for DGX (faster training)
        'patience': 30,      # Increased patience
        'save_period': 10,   # Save checkpoints every 10 epochs
        'close_mosaic': 20,  # Disable mosaic in last 20 epochs
        'optimizer': 'AdamW',  # AdamW for better convergence
        'lr0': 0.001,          # Initial learning rate
        'lrf': 0.01,           # Final learning rate multiplier
        'momentum': 0.937,
        'weight_decay': 0.0005,
        'warmup_epochs': 5.0,  # Longer warmup for stability
        'warmup_momentum': 0.8,
        'warmup_bias_lr': 0.1,
        'cos_lr': True,        # Cosine learning rate schedule
    },
    
    'augmentation': {
        'augment': True,
        
        # Geometric augmentations (optimized for HVAC blueprints)
        'mosaic': 1.0,        # Full mosaic for context
        'mixup': 0.0,         # Disabled for technical drawings
        'copy_paste': 0.4,    # Increased for small object density
        'degrees': 15.0,      # Rotation for scan variations
        'translate': 0.1,
        'scale': 0.6,         # Scale variation
        'shear': 0.0,         # No shear for technical drawings
        'perspective': 0.0,   # No perspective warp
        'fliplr': 0.5,        # Horizontal flip
        'flipud': 0.5,        # Vertical flip
        
        # Color augmentations
        'hsv_h': 0.015,       # Minimal hue variation
        'hsv_s': 0.7,         # Saturation (faded ink simulation)
        'hsv_v': 0.4,         # Brightness (dark scans)
        
        # Advanced augmentation
        'use_albumentations': True,
        'albumentations_p': 0.6,  # Higher probability on DGX
    },
    
    'loss_weights': {
        'box': 7.5,
        'cls': 0.5,
        'dfl': 1.5,
        'seg': 1.0,  # Segmentation loss
    },
    
    'validation': {
        'val': True,
        'plots': True,
        'save_json': True,
        'save_hybrid': True,
        'conf': 0.001,        # Low confidence for validation
        'iou': 0.6,
        'max_det': 300,
    },
    
    'logging': {
        'verbose': True,
        'tensorboard': True,
        'exist_ok': True,
    },
    
    'dgx_optimizations': {
        'pin_memory': True,           # Pin memory for faster GPU transfer
        'persistent_workers': True,   # Keep workers alive between epochs
        'prefetch_factor': 2,         # Prefetch batches
        'cuda_sync': False,           # Async CUDA operations
    }
}

# Create directories
os.makedirs(project_dir, exist_ok=True)

# Save configuration
config_path = os.path.join(workspace_dir, 'training_config_dgx.yaml')
with open(config_path, 'w') as f:
    yaml.dump(training_config, f, default_flow_style=False, sort_keys=False)

print(f"\n💾 Configuration saved to: {config_path}")
print(f"\n📊 Training Summary:")
print(f"   Run name: {run_name}")
print(f"   Epochs: {training_config['training']['epochs']}")
print(f"   Learning rate: {training_config['training']['lr0']} → "
      f"{training_config['training']['lr0'] * training_config['training']['lrf']}")
print(f"   Optimizer: {training_config['training']['optimizer']}")
print(f"   Warmup: {training_config['training']['warmup_epochs']} epochs")

print("\n🚀 DGX Optimizations:")
print(f"   ✓ Multi-GPU support ({gpus_to_use} GPUs)")
print(f"   ✓ Mixed precision training (FP16)")
print(f"   ✓ cuDNN benchmark mode")
print(f"   ✓ Persistent workers")
print(f"   ✓ Pin memory")
print(f"   ✓ Prefetch factor: 2")

print("\n⚠️  Spark-Friendly Settings:")
print(f"   ✓ Limited to {gpus_to_use} GPUs (max 4)")
print(f"   ✓ Conservative batch size per GPU")
print(f"   ✓ Cache disabled (disk-based)")
print(f"   ✓ Worker count balanced with CPU cores")

print("\n" + "="*70)
print("✅ CONFIGURATION COMPLETE")
print("="*70)

# Store for next cells
globals()['training_config'] = training_config
globals()['config_path'] = config_path


In [None]:
# --- STEP 4: DGX SPARK OPTIMIZED TRAINING EXECUTION ---
import os
import yaml
import torch
from ultralytics import YOLO
import gc

# Load configuration
workspace_dir = os.path.expanduser('~/hvac_workspace')
config_path = os.path.join(workspace_dir, 'training_config_dgx.yaml')

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

PROJECT_DIR = config['paths']['project_dir']
RUN_NAME = config['paths']['run_name']
DATA_YAML = config['paths']['data_yaml']
MODEL_ARCH = config['model']['architecture']

print("="*70)
print("🚀 STARTING DGX SPARK OPTIMIZED TRAINING")
print("="*70)
print(f"Project: {PROJECT_DIR}")
print(f"Run: {RUN_NAME}")
print(f"Data: {DATA_YAML}")
print(f"Model: {MODEL_ARCH}")
print("="*70)

# Display GPU status
if torch.cuda.is_available():
    gpu_count = torch.cuda.device_count()
    print(f"\n🖥️ GPU Status:")
    for i in range(gpu_count):
        props = torch.cuda.get_device_properties(i)
        mem_allocated = torch.cuda.memory_allocated(i) / 1e9
        mem_reserved = torch.cuda.memory_reserved(i) / 1e9
        mem_total = props.total_memory / 1e9
        print(f"   GPU {i} ({props.name}):")
        print(f"      Total: {mem_total:.2f} GB")
        print(f"      Allocated: {mem_allocated:.2f} GB")
        print(f"      Reserved: {mem_reserved:.2f} GB")
        print(f"      Free: {mem_total - mem_reserved:.2f} GB")
else:
    print("\n⚠️ WARNING: No GPU detected!")

# Clear any cached memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    print("\n🧹 Cleared GPU cache")

# Enable cuDNN optimizations
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
print("✅ cuDNN benchmark mode enabled")

# Smart Resume Logic
last_ckpt = os.path.join(PROJECT_DIR, RUN_NAME, "weights", "last.pt")
resume_training = os.path.exists(last_ckpt)

if resume_training:
    print(f"\n🔄 RESUMING training from checkpoint: {last_ckpt}")
    print("   Loading model and optimizer state...")
    model = YOLO(last_ckpt)
    
    # Resume training
    print("\n🏃 Resuming training...\n")
    results = model.train(resume=True)
    
else:
    print(f"\n🆕 STARTING NEW training run with {MODEL_ARCH}")
    print("   Initializing model...")
    model = YOLO(MODEL_ARCH)

    # Prepare training arguments from config
    device_list = config['hardware']['device']
    if isinstance(device_list, list) and len(device_list) > 1:
        # Multi-GPU training
        device_str = ','.join(map(str, device_list))
        print(f"\n🖥️ Multi-GPU training enabled: GPUs {device_str}")
    else:
        device_str = str(device_list[0]) if isinstance(device_list, list) else str(device_list)
        print(f"\n🖥️ Single-GPU training: GPU {device_str}")
    
    train_args = {
        # Paths
        'data': DATA_YAML,
        'project': PROJECT_DIR,
        'name': RUN_NAME,
        
        # Hardware
        'imgsz': config['hardware']['imgsz'],
        'batch': config['hardware']['batch'],
        'workers': config['hardware']['workers'],
        'cache': config['hardware']['cache'],
        'amp': config['hardware']['amp'],
        'device': device_list,
        
        # Training
        'epochs': config['training']['epochs'],
        'patience': config['training']['patience'],
        'save_period': config['training']['save_period'],
        'close_mosaic': config['training']['close_mosaic'],
        'optimizer': config['training']['optimizer'],
        'lr0': config['training']['lr0'],
        'lrf': config['training']['lrf'],
        'momentum': config['training']['momentum'],
        'weight_decay': config['training']['weight_decay'],
        'warmup_epochs': config['training']['warmup_epochs'],
        'warmup_momentum': config['training']['warmup_momentum'],
        'warmup_bias_lr': config['training']['warmup_bias_lr'],
        'cos_lr': config['training'].get('cos_lr', True),
        
        # Augmentation
        'augment': config['augmentation']['augment'],
        'mosaic': config['augmentation']['mosaic'],
        'mixup': config['augmentation']['mixup'],
        'copy_paste': config['augmentation']['copy_paste'],
        'degrees': config['augmentation']['degrees'],
        'translate': config['augmentation']['translate'],
        'scale': config['augmentation']['scale'],
        'shear': config['augmentation']['shear'],
        'perspective': config['augmentation']['perspective'],
        'fliplr': config['augmentation']['fliplr'],
        'flipud': config['augmentation']['flipud'],
        'hsv_h': config['augmentation']['hsv_h'],
        'hsv_s': config['augmentation']['hsv_s'],
        'hsv_v': config['augmentation']['hsv_v'],
        
        # Loss weights
        'box': config['loss_weights']['box'],
        'cls': config['loss_weights']['cls'],
        'dfl': config['loss_weights']['dfl'],
        'seg': config['loss_weights'].get('seg', 1.0),
        
        # Validation
        'val': config['validation']['val'],
        'plots': config['validation']['plots'],
        'save_json': config['validation']['save_json'],
        'save_hybrid': config['validation']['save_hybrid'],
        'conf': config['validation']['conf'],
        'iou': config['validation']['iou'],
        'max_det': config['validation']['max_det'],
        
        # Logging
        'verbose': config['logging']['verbose'],
        'exist_ok': config['logging']['exist_ok'],
    }
    
    print("\n🎯 Training Configuration:")
    print(f"   Epochs: {train_args['epochs']} (patience: {train_args['patience']})")
    print(f"   Image Size: {train_args['imgsz']}")
    print(f"   Batch Size: {train_args['batch']} (total across GPUs)")
    print(f"   Workers: {train_args['workers']}")
    print(f"   Learning Rate: {train_args['lr0']} → {train_args['lr0'] * train_args['lrf']}")
    print(f"   Optimizer: {train_args['optimizer']} (cosine schedule: {train_args['cos_lr']})")
    print(f"   Mixed Precision: {train_args['amp']}")
    print(f"   Augmentations: Mosaic={train_args['mosaic']}, CopyPaste={train_args['copy_paste']}")
    
    print("\n⚡ DGX Performance Features:")
    print(f"   ✓ Multi-GPU: {len(device_list) if isinstance(device_list, list) else 1} GPUs")
    print(f"   ✓ Mixed Precision (FP16): 2-3x speedup")
    print(f"   ✓ cuDNN Benchmark: Auto-optimized kernels")
    print(f"   ✓ High-speed NVMe data loading")
    
    print("\n🏃 Starting training...")
    print("   Monitor progress in TensorBoard (next cell)")
    print("   Training may take several hours depending on dataset size")
    print("\n" + "="*70 + "\n")
    
    # Start training
    try:
        results = model.train(**train_args)
        
        print("\n" + "="*70)
        print("✅ TRAINING COMPLETE")
        print("="*70)
        print(f"Best model: {os.path.join(PROJECT_DIR, RUN_NAME, 'weights', 'best.pt')}")
        print(f"Last checkpoint: {os.path.join(PROJECT_DIR, RUN_NAME, 'weights', 'last.pt')}")
        
        # Display final metrics
        if hasattr(results, 'results_dict'):
            metrics = results.results_dict
            print("\n📊 Final Metrics:")
            if 'metrics/mAP50(B)' in metrics:
                print(f"   Box mAP50: {metrics['metrics/mAP50(B)']:.4f}")
            if 'metrics/mAP50-95(B)' in metrics:
                print(f"   Box mAP50-95: {metrics['metrics/mAP50-95(B)']:.4f}")
            if 'metrics/mAP50(M)' in metrics:
                print(f"   Mask mAP50: {metrics['metrics/mAP50(M)']:.4f}")
            if 'metrics/mAP50-95(M)' in metrics:
                print(f"   Mask mAP50-95: {metrics['metrics/mAP50-95(M)']:.4f}")
        
    except Exception as e:
        print(f"\n❌ ERROR during training: {e}")
        print("\n💡 Troubleshooting tips:")
        print("   - Check GPU memory: reduce batch size if OOM")
        print("   - Verify data.yaml path is correct")
        print("   - Ensure dataset has train/val splits")
        print("   - Check TensorBoard logs for details")
        raise
    
    finally:
        # Clean up GPU memory
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            gc.collect()
            print("\n🧹 GPU memory cleared")

print("\n" + "="*70)
print("🎉 TRAINING SESSION COMPLETE")
print("="*70)


In [None]:
# --- STEP 5: TENSORBOARD MONITORING ---
import os
import yaml
from IPython.display import display, HTML

# Load configuration
workspace_dir = os.path.expanduser('~/hvac_workspace')
config_path = os.path.join(workspace_dir, 'training_config_dgx.yaml')

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

runs_dir = config['paths']['project_dir']
run_name = config['paths']['run_name']
tensorboard_dir = os.path.join(runs_dir, run_name)

print("="*70)
print("📊 TENSORBOARD MONITORING")
print("="*70)
print(f"\nLog directory: {tensorboard_dir}")
print("\n🚀 Launch TensorBoard with:")
print(f"\n   tensorboard --logdir {tensorboard_dir} --port 6006")
print("\n📊 Metrics to monitor:")
print("   - Loss curves (train/val)")
print("   - mAP50 and mAP50-95")
print("   - Precision and Recall")
print("   - Learning rate schedule")
print("   - GPU utilization")
print("\n💡 Access TensorBoard:")
print("   1. Open a terminal on DGX")
print(f"   2. Run: tensorboard --logdir {tensorboard_dir} --port 6006")
print("   3. Navigate to: http://localhost:6006")
print("   4. Or tunnel via SSH: ssh -L 6006:localhost:6006 dgx-server")

# Try to load TensorBoard in notebook if possible
try:
    %load_ext tensorboard
    %tensorboard --logdir {tensorboard_dir} --port 6006
except:
    print("\n⚠️ TensorBoard extension not available in this environment")
    print("   Use the command-line instructions above instead.")

print("\n" + "="*70)


In [None]:
# --- STEP 6: MODEL EVALUATION & COMPARISON ---
import os
import yaml
import json
import pandas as pd
from ultralytics import YOLO
import matplotlib.pyplot as plt
import seaborn as sns

# Load configuration
with open('os.path.join(os.path.expanduser('~/hvac_workspace'), 'training_config_dgx.yaml')', 'r') as f:
    config = yaml.safe_load(f)

PROJECT_DIR = config['paths']['project_dir']
RUN_NAME = config['paths']['run_name']
best_model_path = os.path.join(PROJECT_DIR, RUN_NAME, 'weights', 'best.pt')

print("="*70)
print("📊 MODEL EVALUATION")
print("="*70)

if os.path.exists(best_model_path):
    print(f"Loading best model from: {best_model_path}")
    model = YOLO(best_model_path)
    
    # Run validation
    print("\n🔍 Running validation on test set...")
    results = model.val(
        data=config['paths']['data_yaml'],
        split='test',
        imgsz=config['hardware']['imgsz'],
        batch=config['hardware']['batch'],
        conf=0.25,
        iou=0.45,
        plots=True,
        save_json=True
    )
    
    # Display results
    print("\n📈 Validation Results:")
    print(f"   mAP50: {results.box.map50:.4f}")
    print(f"   mAP50-95: {results.box.map:.4f}")
    print(f"   Precision: {results.box.mp:.4f}")
    print(f"   Recall: {results.box.mr:.4f}")
    
    if hasattr(results, 'seg'):
        print(f"\n   Mask mAP50: {results.seg.map50:.4f}")
        print(f"   Mask mAP50-95: {results.seg.map:.4f}")
    
    # Per-class results
    print("\n📊 Per-Class Performance:")
    class_results = []
    for i, (name, map50, map) in enumerate(zip(model.names.values(), results.box.map50_per_class, results.box.map_per_class)):
        class_results.append({
            'Class': name,
            'mAP50': float(map50),
            'mAP50-95': float(map)
        })
    
    df = pd.DataFrame(class_results)
    df = df.sort_values('mAP50-95', ascending=False)
    print(df.to_string(index=False))
    
    # Save results
    results_path = os.path.join(PROJECT_DIR, RUN_NAME, 'evaluation_results.json')
    with open(results_path, 'w') as f:
        json.dump({
            'overall': {
                'mAP50': float(results.box.map50),
                'mAP50-95': float(results.box.map),
                'precision': float(results.box.mp),
                'recall': float(results.box.mr)
            },
            'per_class': class_results
        }, f, indent=2)
    print(f"\n💾 Results saved to: {results_path}")
    
    # Visualize per-class performance
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # mAP50 bar plot
    axes[0].barh(df['Class'][:10], df['mAP50'][:10])
    axes[0].set_xlabel('mAP50')
    axes[0].set_title('Top 10 Classes by mAP50')
    axes[0].set_xlim([0, 1])
    
    # mAP50-95 bar plot
    axes[1].barh(df['Class'][:10], df['mAP50-95'][:10])
    axes[1].set_xlabel('mAP50-95')
    axes[1].set_title('Top 10 Classes by mAP50-95')
    axes[1].set_xlim([0, 1])
    
    plt.tight_layout()
    plt.savefig(os.path.join(PROJECT_DIR, RUN_NAME, 'class_performance.png'))
    plt.show()
    
else:
    print(f"❌ Model not found: {best_model_path}")
    print("   Please ensure training completed successfully.")

print("\n" + "="*70)
print("✅ EVALUATION COMPLETE")
print("="*70)

In [None]:
# --- STEP 7: EXPORT MODEL FOR DEPLOYMENT ---
import os
import yaml
from ultralytics import YOLO

# Load configuration
with open('os.path.join(os.path.expanduser('~/hvac_workspace'), 'training_config_dgx.yaml')', 'r') as f:
    config = yaml.safe_load(f)

PROJECT_DIR = config['paths']['project_dir']
RUN_NAME = config['paths']['run_name']
best_model_path = os.path.join(PROJECT_DIR, RUN_NAME, 'weights', 'best.pt')

print("="*70)
print("📦 MODEL EXPORT")
print("="*70)

if os.path.exists(best_model_path):
    model = YOLO(best_model_path)
    
    # Export to ONNX for production deployment
    print("\n🔄 Exporting to ONNX format...")
    onnx_path = model.export(
        format='onnx',
        imgsz=config['hardware']['imgsz'],
        optimize=True,
        simplify=True
    )
    print(f"✅ ONNX model saved to: {onnx_path}")
    
    # Optional: Export to TorchScript
    print("\n🔄 Exporting to TorchScript format...")
    torchscript_path = model.export(
        format='torchscript',
        imgsz=config['hardware']['imgsz']
    )
    print(f"✅ TorchScript model saved to: {torchscript_path}")
    
    # Model info
    print("\n📊 Model Information:")
    print(f"   Architecture: {config['model']['architecture']}")
    print(f"   Input Size: {config['hardware']['imgsz']}")
    print(f"   Classes: {len(model.names)}")
    print(f"   Parameters: {sum(p.numel() for p in model.model.parameters())/1e6:.2f}M")
    
    print("\n🚀 Deployment Instructions:")
    print("   1. Copy the exported model to your deployment environment")
    print("   2. Use ONNX Runtime for efficient inference")
    print("   3. Recommended confidence threshold: 0.25")
    print("   4. Recommended IoU threshold: 0.45")
    
else:
    print(f"❌ Model not found: {best_model_path}")

print("\n" + "="*70)
print("✅ EXPORT COMPLETE")
print("="*70)

## 🎉 DGX Spark Training Complete!

### ✅ What Was Accomplished
- ✓ Trained YOLO11 model on NVIDIA DGX infrastructure
- ✓ Utilized multi-GPU acceleration (if available)
- ✓ Optimized for Spark resource management
- ✓ Generated production-ready model exports

### 📊 Next Steps

#### 1. Review Results
- Check TensorBoard logs: `tensorboard --logdir ~/hvac_workspace/runs/segment/[run_name]`
- Review evaluation metrics in the cells above
- Examine per-class performance for problem areas

#### 2. Test Inference
```python
from ultralytics import YOLO
import os

# Load best model
model_path = os.path.expanduser('~/hvac_workspace/runs/segment/[run_name]/weights/best.pt')
model = YOLO(model_path)

# Run inference
results = model.predict(
    source='path/to/test/images',
    save=True,
    conf=0.25,
    iou=0.45
)
```

#### 3. Optimize Further
If results need improvement:
- **Low mAP**: Increase epochs, collect more data, or adjust augmentation
- **Overfitting**: Increase augmentation strength, add more regularization
- **Class imbalance**: Use copy_paste augmentation, weighted sampling
- **Small objects**: Increase image size to 1280 or 1536

#### 4. Deploy to Production

**Option A: ONNX Deployment (Recommended)**
```bash
# Model is already exported to ONNX format
# Use ONNX Runtime for inference
pip install onnxruntime-gpu
```

**Option B: TensorRT (Maximum Performance)**
```python
# Export to TensorRT on DGX
model.export(format='engine', device=0, half=True, imgsz=1024)
```

**Option C: Native PyTorch**
```python
# Use best.pt directly with Ultralytics
model = YOLO('best.pt')
```

### 🔧 DGX Spark Best Practices

#### Resource Management
- **Monitor GPU usage**: `nvidia-smi -l 1` to watch in real-time
- **Check Spark resources**: Ensure you're not overloading the cluster
- **Batch size tuning**: Adjust based on GPU memory and throughput
- **Worker count**: Balance between CPU cores and I/O bandwidth

#### Multi-GPU Training
For even faster training, you can use more GPUs:
```python
# Edit training_config_dgx.yaml
# hardware:
#   device: [0, 1, 2, 3]  # Use 4 GPUs
#   batch: 48  # 12 per GPU
```

#### Storage Considerations
- Use NVMe storage for datasets (faster I/O)
- Keep models and checkpoints on high-speed storage
- Archive old runs to slower storage

### 🐛 Troubleshooting

#### Out of Memory (OOM)
```python
# Reduce batch size
batch: 8  # instead of 12

# Or reduce image size
imgsz: 640  # instead of 1024
```

#### Slow Training
- Enable AMP (mixed precision): `amp: True`
- Increase workers: `workers: 16`
- Use `cache: True` if you have enough RAM
- Enable cuDNN benchmark: Already enabled in this notebook

#### Spark Resource Conflicts
- Limit GPU usage: `device: [0, 1]` instead of all GPUs
- Reduce worker count to free CPU cores
- Check with your cluster admin for resource allocation

### 📚 Additional Resources
- [Ultralytics YOLO Documentation](https://docs.ultralytics.com/)
- [NVIDIA DGX Best Practices](https://docs.nvidia.com/dgx/)
- [TensorRT Optimization Guide](https://docs.nvidia.com/deeplearning/tensorrt/)
- Project repository: Check `ai_model/OPTIMIZATION_GUIDE.md`

### 🔒 Security Notes
- Store API keys in environment variables, not in code
- Use `.gitignore` to exclude model weights from version control
- Keep dataset access controls in place
- Review model outputs before deploying to production

---

**Happy Training! 🚀**
