%markdown
# Ultralytics YOLO Training on Databricks Serverless GPU Compute (SGC)

This notebook demonstrates how to set up and run distributed YOLO model training using Databricks Serverless GPU Compute with proper resource management and MLflow integration.

## 🚀 Overview

This implementation leverages:
- **Databricks Serverless GPU Compute (SGC)** for scalable, on-demand GPU resources
- **Ultralytics YOLO v11** for state-of-the-art object detection
- **Distributed Training** with PyTorch DDP and NCCL backend
- **MLflow** for experiment tracking and model management
- **Unity Catalog Volumes** for persistent data storage

## 📋 Prerequisites

### Environment Requirements
- Databricks Runtime with GPU support
- Access to Serverless GPU Compute
- Unity Catalog enabled workspace
- MLflow experiment tracking permissions

### Required Packages
```python
ultralytics==8.3.204
mlflow>=3.0
nvidia-ml-py==13.580.82  # GPU monitoring
pyrsmi==0.2.0           # AMD GPU support (optional)
threadpoolctl==3.1.0
```

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Databricks SGC Cluster                  │
├─────────────────────────────────────────────────────────────┤
│  GPU 0    │  GPU 1    │  GPU 2    │  ...    │  GPU N-1    │
│  Rank 0   │  Rank 1   │  Rank 2   │  ...    │  Rank N-1   │
│  (Master) │           │           │         │             │
└─────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────┐
│                    Unity Catalog Volume                     │
│  /Volumes/catalog/schema/volume/                           │
│  ├── data/           # Training datasets                   │
│  ├── raw_model/      # Pre-trained models                 │
│  └── training_runs/  # Output artifacts                   │
└─────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────┐
│                      MLflow Tracking                       │
│  • Experiment logging                                      │
│  • Model versioning                                        │
│  • Metrics & artifacts                                     │
└─────────────────────────────────────────────────────────────┘
```

## 🔧 Setup Instructions

### 1. Environment Setup

First, use A10 SGC with Env version 4, then install required packages and restart Python runtime:

```python
%pip install -U mlflow>=3.0
%pip install ultralytics==8.3.204
%pip install nvidia-ml-py==13.580.82
dbutils.library.restartPython()
```

### 2. Unity Catalog Configuration

Create necessary catalog structure:

```sql
CREATE CATALOG IF NOT EXISTS your_catalog;
CREATE SCHEMA IF NOT EXISTS your_catalog.computer_vision;
CREATE VOLUME IF NOT EXISTS your_catalog.computer_vision.yolo;
```

### 3. Data Preparation

**Supported Dataset Formats:**
- COCO format (recommended)
- YOLO format
- Custom annotations

**Directory Structure:**
```
/Volumes/catalog/schema/volume/
├── data/
│   └── dataset_name/
│       ├── images/
│       │   ├── train/
│       │   └── val/
│       ├── labels/
│       │   ├── train/
│       │   └── val/
│       └── data.yaml
├── raw_model/
│   └── yolo11n.pt
└── training_runs/
```

### 4. MLflow Configuration

Set up experiment tracking:

```python
experiment_name = "/Users/your.email@company.com/YOLO_Experiments"
mlflow.set_experiment(experiment_name)
os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING'] = "true"
```

## 🚀 Distributed Training

### Key Features

1. **Automatic GPU Detection**: Dynamically scales across available GPUs
2. **Fault Tolerance**: Proper cleanup with try/finally blocks
3. **Resource Management**: Prevents NCCL process group leaks
4. **MLflow Integration**: Automatic logging of metrics, models, and artifacts

### Training Configuration

```python
@distributed(gpus=8, gpu_type='A10', remote=True)
def train_fn(world_size=None, parent_run_id=None):
    try:
        # Setup distributed environment
        rank, world_size, device = setup()
        
        # Initialize YOLO model
        model = YOLO("yolo11n.pt")
        
        # Start training with optimized parameters
        model.train(
            task="detect",
            batch=16,                    # Adjust based on GPU memory
            device=[LOCAL_RANK],
            data=data_yaml_path,
            epochs=100,
            project=training_output_path,
            exist_ok=True,
            # Data augmentation
            fliplr=1,                   # Horizontal flip probability
            flipud=1,                   # Vertical flip probability  
            perspective=0.001,          # Perspective transformation
            degrees=0.45                # Rotation degrees
        )
        
        # Validation and export (rank 0 only)
        if RANK in (0, -1):
            model.val()
            model.export()
            
    finally:
        # Critical: Always cleanup to prevent resource leaks
        cleanup()
```

## 📊 Monitoring & Debugging

### Environment Variables

```python
# Debugging options
os.environ["CUDA_LAUNCH_BLOCKING"] = "0"  # Set to "1" for debugging
os.environ["NCCL_DEBUG"] = "INFO"         # NCCL communication logs
os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"   # Detailed NCCL debugging
```

### MLflow System Metrics

Enable comprehensive system monitoring:

```python
os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING'] = "true"
```

This captures:
- GPU utilization and memory
- CPU usage and memory
- Network I/O
- Disk I/O

## 🔍 Troubleshooting

### Common Issues

1. **NCCL Process Group Warning**
   ```
   ProcessGroupNCCL.cpp:1479] Warning: destroy_process_group() was not called
   ```
   **Solution**: Ensure `cleanup()` is called in finally block

2. **CUDA Out of Memory**
   ```
   RuntimeError: CUDA out of memory
   ```
   **Solution**: Reduce batch size or use gradient accumulation

3. **Distributed Training Hangs**
   **Solution**: Check NCCL environment variables and network connectivity

### Debug Commands

```python
# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")

# Verify distributed setup
print(f"Rank: {RANK}, Local Rank: {LOCAL_RANK}")
print(f"World Size: {dist.get_world_size() if dist.is_initialized() else 'Not initialized'}")
```

## 📚 References

- [Ultralytics YOLO Documentation](https://docs.ultralytics.com/)
- [Databricks Serverless GPU Compute](https://docs.databricks.com/en/compute/serverless-gpu.html)
- [PyTorch Distributed Training](https://pytorch.org/docs/stable/distributed.html)
- [MLflow Model Registry](https://mlflow.org/docs/latest/model-registry.html)
- [Unity Catalog Volumes](https://docs.databricks.com/en/catalog/volumes.html)

## 🏷️ Model Versioning

This setup automatically:
- Logs training metrics to MLflow
- Saves model artifacts to Unity Catalog
- Creates model signatures for deployment
- Tracks hyperparameters and dataset versions

## 🚦 Best Practices

1. **Resource Management**: Always use try/finally for cleanup
2. **Data Location**: Use Unity Catalog Volumes for overall governance
3. **Batch Size**: Start with smaller batches and scale up
4. **Monitoring**: Enable MLflow system metrics
5. **Checkpointing**: Save intermediate results for long training runs (under /tmp/)
6. **Validation**: Run validation on rank 0 only to avoid conflicts

---

**Next Steps**: Run the cells below to start your YOLO training pipeline! 🎯

In [0]:
import serverless_gpu
%pip install -U mlflow>=3.0
%pip install threadpoolctl==3.1.0
%pip install ultralytics==8.3.204
%pip install nvidia-ml-py==13.580.82 # for later mlflow GPU monitoring
%pip install pyrsmi==0.2.0 # for later mlflow AMD GPU monitoring if you have AMD


dbutils.library.restartPython()

In [0]:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset


from ultralytics import YOLO
from serverless_gpu import distributed
import mlflow


import os
from ultralytics import YOLO
import torch
import mlflow
import torch.distributed as dist
from ultralytics import settings
from mlflow.types.schema import Schema, ColSpec
from mlflow.models.signature import ModelSignature
from ultralytics.utils import RANK, LOCAL_RANK

In [0]:
%cat /tmp/Ultralytics/settings.json

# Setup I/O Locations

In [0]:
%sql
create catalog if not exists yyang;
create schema if not exists yyang.computer_vision;
create volume if not exists yyang.computer_vision.yolo;

In [0]:
project_location = '/Volumes/yyang/computer_vision/yolo/'
os.makedirs(f'{project_location}/training_runs/', exist_ok=True)
os.makedirs(f'{project_location}/data/', exist_ok=True)
os.makedirs(f'{project_location}/raw_model/', exist_ok=True)

# volume folder in UC.
volume_project_location = f'{project_location}/training_results/'
os.makedirs(volume_project_location, exist_ok=True)

# or alternatively, ephemeral /tmp/ project location on VM
tmp_project_location = "/tmp/training_results/"
os.makedirs(tmp_project_location, exist_ok=True)

# Image Data I/O

In [0]:
os.chdir('/Workspace/' + dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().rsplit('/', 1)[0])
os.getcwd()

In [0]:
%sh
# curl -L https://github.com/ultralytics/ultralytics/raw/main/ultralytics/cfg/datasets/coco8.yaml -o coco8.yaml
curl -L https://github.com/ultralytics/ultralytics/raw/main/ultralytics/cfg/datasets/coco128.yaml -o coco128.yaml

In [0]:
%sh
# cat ./coco8.yaml
cat ./coco128.yaml

__REMEMBER: change below cell path for your data.yaml as input to YOLO train later__

In [0]:
import os

# with open('coco8.yaml', 'r') as file:
with open('coco128.yaml', 'r') as file:

    data = file.read()

#: here specific for this dataset, we have to update the .yaml file with real I/O locations.
os.makedirs(f'{project_location}/data/coco128', exist_ok=True)
data = data.replace('path: coco128', f'path: {project_location}data/coco128')


# with open('coco8.yaml', 'w') as file:
with open('coco128.yaml', 'w') as file:

    file.write(data)

In [0]:
%sh
# cat ./coco8.yaml
cat ./coco128.yaml

In [0]:
import yaml

# with open('./coco8.yaml', 'r') as file:
with open('./coco128.yaml', 'r') as file:
    data = yaml.safe_load(file)

data

In [0]:
# import requests
# import tarfile
# import io

# response = requests.get(data['download'])
# tar = tarfile.open(fileobj=io.BytesIO(response.content), mode='r:gz')
# tar.extractall(path=data['path'])
# tar.close()

In [0]:
import requests, zipfile, io

response = requests.get(data['download'])
z = zipfile.ZipFile(io.BytesIO(response.content))
extraction_path = '/'.join(data['path'].split('/')[:-1]) # do this since we dont want to duplicate the "/coco128/" part twice in the final path.
print(extraction_path)
z.extractall(extraction_path)

In [0]:
ls

In [0]:
# %sh
# mv ./coco128.yaml /Volumes/yyang/computer_vision/yolo/data/
import shutil
import os

data_yaml_path = f'{extraction_path}/coco128.yaml'
print('data_yaml_path is:', data_yaml_path)

if os.path.exists(data_yaml_path):
    os.remove(data_yaml_path)
shutil.move('./coco128.yaml', data_yaml_path)


# Setup Mlflow Related

In [0]:
from ultralytics import YOLO
import torch
import mlflow
import torch.distributed as dist
from ultralytics import settings
from mlflow.types.schema import Schema, ColSpec
from mlflow.models.signature import ModelSignature

input_schema = Schema(
    [
        ColSpec("string", "image_source"),
    ]
)
output_schema = Schema([ColSpec("string","class_name"),
                        ColSpec("integer","class_num"),
                        ColSpec("double","confidence")]
                       )

signature = ModelSignature(inputs=input_schema, 
                           outputs=output_schema)

# settings.update({"mlflow":False}) # Specifically, it disables the integration with MLflow. By setting the mlflow key to False, you are instructing the ultralytics library not to use MLflow for logging or tracking experiments.

# ultralytics level setting with MLflow
settings.update({"mlflow":True}) # if you do want to autolog.
# # Config MLflow
mlflow.autolog(disable=True)
mlflow.end_run()

In [0]:
import os

experiment_name = "/Users/yang.yang@databricks.com/SGC_YOLO_Test/Experiments_YOLO_CoCo"

os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING'] = "true"
print(f"MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING set to {os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING']}")

os.environ['MLFLOW_EXPERIMENT_NAME'] = experiment_name
print(f"MLFLOW_EXPERIMENT_NAME set to {os.environ['MLFLOW_EXPERIMENT_NAME']}")

In [0]:
mlflow.set_experiment(experiment_name)
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id

# Reset MLFLOW_RUN_ID, so we dont bump into the wrong one.
if 'MLFLOW_RUN_ID' in os.environ:
    del os.environ['MLFLOW_RUN_ID']

with mlflow.start_run(experiment_id=experiment_id) as parent_run:
    active_run_id = mlflow.last_active_run().info.run_id
    active_run_name = mlflow.last_active_run().info.run_name

In [0]:
print(experiment_name, experiment_id, active_run_id)

In [0]:
YOLO("yolo11n")

In [0]:
# data_yaml_path = "./coco128.yaml" # ref: https://github.com/ultralytics/ultralytics/blob/main/ultralytics/cfg/datasets/coco128.yaml

data_yaml_path = '/Volumes/yyang/computer_vision/yolo/data/coco128.yaml'

## Start to Train using SGC

In [0]:
def setup():
    """Initialize the distributed training process group"""
    # Check if we're in a distributed environment
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        rank = int(os.environ['RANK'])
        world_size = int(os.environ['WORLD_SIZE'])
        local_rank = int(os.environ.get('LOCAL_RANK', 0))
    else:
        # Fallback for single GPU
        rank = 0
        world_size = 1
        local_rank = 0

    # Initialize process group
    if world_size > 1:
        if not dist.is_initialized():
            dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)

    # Set device
    if torch.cuda.is_available():
        device = torch.device(f'cuda:{local_rank}')
        torch.cuda.set_device(device)
    else:
        device = torch.device('cpu')

    return rank, world_size, device
  
def cleanup():
    """Clean up the distributed training process group"""
    if dist.is_initialized():
        dist.destroy_process_group()

In [0]:
model = YOLO(f"{project_location}/raw_model/yolo11n.pt")
model.train(
    task="detect",
    batch=16, # Batch size, with three modes: set as an integer (e.g., batch=16), auto mode for 60% GPU memory utilization (batch=-1), or auto mode with specified utilization fraction (batch=0.70).
    device=-1, # need to be LOCAL_RANK, i.e., 0 for this case since we already init_process_group beforehand. RANK wont work. There is no need to specify [0,1] given for example if we have 2 GPUs per node. [0,1] with world_size of 4 or 2 beforehand will both fail. 
    data=data_yaml_path,
    epochs=20,
    project=f'{tmp_project_location}', # local VM ephermal location
    # project=f'{volume_project_location}', # volume path still wont work
    exist_ok=True,
    fliplr=1,
    flipud=1,
    perspective=0.001,
    degrees=.45
)

In [0]:
settings.update({"mlflow":True}) # if you do want to autolog.
mlflow.autolog(disable = False)

print('data_yaml_path is:', data_yaml_path)

import logging
logging.getLogger("mlflow").setLevel(logging.DEBUG)


@distributed(gpus=4, gpu_type='A10', remote=True)
#: -----------worker func: this function is visible to each GPU device.-------------------
def train_fn(world_size = None, parent_run_id = None):


    # import os
    # from ultralytics import YOLO
    # import torch
    # import mlflow
    # import torch.distributed as dist
    # from ultralytics import settings
    # from mlflow.types.schema import Schema, ColSpec
    # from mlflow.models.signature import ModelSignature
    from ultralytics.utils import RANK, LOCAL_RANK

    # Setup distributed training
    rank, world_size, device = setup()

    print(f"Rank: {rank}, World Size: {world_size}, Device: {device}")
    print(f"Rank: {RANK}, World Size: {world_size}, Device: {LOCAL_RANK}")


    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA device count: {torch.cuda.device_count()}")
        print(f"Current CUDA device: {torch.cuda.current_device()}")


    ############################
    os.environ["CUDA_LAUNCH_BLOCKING"] = "0" # use 1 for synchronization operation, debugging model prefers this.
    os.environ["NCCL_DEBUG"] = "INFO" # "WARN" # for more debugging info on the NCCL side.
    os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING'] = "true"
    os.environ['MLFLOW_EXPERIMENT_NAME'] = experiment_name
    # We set the experiment details here
    experiment = mlflow.set_experiment(experiment_name)
    
    # # #: from repo issue https://github.com/ultralytics/ultralytics/issues/11680
    # ## conclusion: doesn't work, has error :"ValueError: Invalid CUDA 'device=0,1' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU."
    # # torch.backends.cudnn.benchmark = False
    # # torch.cuda.synchronize()
    # print(f"------Before init_process_group, we have: {RANK=} -- {LOCAL_RANK=}------")
    # dist.init_process_group(
    #     backend="nccl",
    #     init_method="env://",
    #     world_size=world_size,
    #     rank=RANK, # this must be from 0 to world_size - 1. LOCAL_RANK wont work.
    # )
    # print(f"------After init_process_group, we have: {RANK=} -- {LOCAL_RANK=}------")

    print('data_yaml_path is:', data_yaml_path)
    #
    # with mlflow.start_run(run_id=parent_run_id):
    with mlflow.start_run():
        # model = YOLO(f"yolov11n.pt") # shared location
        model = YOLO(f"{project_location}/raw_model/yolo11n.pt")
        model.train(
            task="detect",
            batch=16, # Batch size, with three modes: set as an integer (e.g., batch=16), auto mode for 60% GPU memory utilization (batch=-1), or auto mode with specified utilization fraction (batch=0.70).
            device=[LOCAL_RANK], # need to be LOCAL_RANK, i.e., 0 for this case since we already init_process_group beforehand. RANK wont work. There is no need to specify [0,1] given for example if we have 2 GPUs per node. [0,1] with world_size of 4 or 2 beforehand will both fail. 
            data=data_yaml_path,
            epochs=20,
            project=f'{tmp_project_location}', # local VM ephermal location
            # project=f'{volume_project_location}', # volume path still wont work
            exist_ok=True,
            fliplr=1,
            flipud=1,
            perspective=0.001,
            degrees=.45
        )
        success = None
        if RANK in (0, -1):
            success = model.val()
            if success:
                model.export() # ref: https://docs.ultralytics.com/modes/export/#introduction
        

    active_run_id = mlflow.last_active_run().info.run_id
    print("For YOLO autologging, active_run_id is: ", active_run_id)

    # after training is done.
    if not dist.is_initialized():
      # import torch.distributed as dist
      dist.init_process_group("nccl")

    local_rank = int(os.environ["LOCAL_RANK"])
    global_rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    print(f"------After training, we have: RANK:{global_rank=} -- LOCAL_RANK:{local_rank=} -- world_size: {world_size=}------")

    if global_rank == 0:
        with mlflow.start_run(run_id=active_run_id) as run:
            mlflow.log_artifact(data_yaml_path, "input_data_yaml")
            # mlflow.log_dict(data, "data.yaml")
            mlflow.log_params({"rank":global_rank})
            mlflow.pytorch.log_model(YOLO(str(model.trainer.best)), "model", signature=signature) # this succeeded

    # clean up
    cleanup()

    return "finished" # can return any picklable object


train_fn.distributed(world_size = None, parent_run_id = None) # now can program can run without specifying manually the parameters of world_size and parent_run_id. 

**Note: I want to test more than 8 GPUs, but currently the dogfood has 8 A10 limitations.**

In [0]:
settings.update({"mlflow":True}) # if you do want to autolog.
mlflow.autolog(disable = False)

print('data_yaml_path is:', data_yaml_path)

import logging
logging.getLogger("mlflow").setLevel(logging.DEBUG)


@distributed(gpus=8, gpu_type='A10', remote=True)
#: -----------worker func: this function is visible to each GPU device.-------------------
def train_fn(world_size = None, parent_run_id = None):
    try:
        from ultralytics.utils import RANK, LOCAL_RANK

        # Setup distributed training
        rank, world_size, device = setup()

        print(f"Rank: {rank}, World Size: {world_size}, Device: {device}")
        print(f"Rank: {RANK}, World Size: {world_size}, Device: {LOCAL_RANK}")


        print(f"PyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA device count: {torch.cuda.device_count()}")
            print(f"Current CUDA device: {torch.cuda.current_device()}")


        ############################
        os.environ["CUDA_LAUNCH_BLOCKING"] = "0" # use 1 for synchronization operation, debugging model prefers this.
        os.environ["NCCL_DEBUG"] = "INFO" # "WARN" # for more debugging info on the NCCL side.
        os.environ['MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING'] = "true"
        os.environ['MLFLOW_EXPERIMENT_NAME'] = experiment_name
        # We set the experiment details here
        experiment = mlflow.set_experiment(experiment_name)
        print('data_yaml_path is:', data_yaml_path)
        
        #
        # with mlflow.start_run(run_id=parent_run_id):
        with mlflow.start_run():
            model = YOLO(f"{project_location}/raw_model/yolo11n.pt")
            model.train(
                task="detect",
                batch=16, # Batch size, with three modes: set as an integer (e.g., batch=16), auto mode for 60% GPU memory utilization (batch=-1), or auto mode with specified utilization fraction (batch=0.70).
                device=[LOCAL_RANK], # need to be LOCAL_RANK, i.e., 0 for this case since we already init_process_group beforehand. RANK wont work. There is no need to specify [0,1] given for example if we have 2 GPUs per node. [0,1] with world_size of 4 or 2 beforehand will both fail. 
                data=data_yaml_path,
                epochs=100,
                project=f'{tmp_project_location}', # local VM ephermal location
                # project=f'{volume_project_location}', # volume path still wont work
                exist_ok=True,
                fliplr=1,
                flipud=1,
                perspective=0.001,
                degrees=.45
            )
            success = None
            if RANK in (0, -1):
                success = model.val()
                if success:
                    model.export() # ref: https://docs.ultralytics.com/modes/export/#introduction
            

        active_run_id = mlflow.last_active_run().info.run_id
        print("For YOLO autologging, active_run_id is: ", active_run_id)

        # after training is done.
        if not dist.is_initialized():
        # import torch.distributed as dist
            dist.init_process_group("nccl")

        local_rank = int(os.environ["LOCAL_RANK"])
        global_rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])
        print(f"------After training, we have: RANK:{global_rank=} -- LOCAL_RANK:{local_rank=} -- world_size: {world_size=}------")

        if global_rank == 0:
            with mlflow.start_run(run_id=active_run_id) as run:
                mlflow.log_artifact(data_yaml_path, "input_data_yaml")
                # mlflow.log_dict(data, "data.yaml")
                mlflow.log_params({"rank":global_rank})
                mlflow.pytorch.log_model(YOLO(str(model.trainer.best)), "model", signature=signature) # this succeeded
                #: TODO: we can log more stuff here
        
        return "finished" # can return any picklable object
    
    finally:
        # clean up
        cleanup()


train_fn.distributed(world_size = None, parent_run_id = None) # now can program can run without specifying manually the parameters of world_size and parent_run_id. 

# Supplemental Below

## 1. Tip about if too long waiting and job failed in node launching stage.
For GPU resource not ready timeout error, consider to add these settings.

error msg: "torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 7/8 clients joined."

```
os.environ['TORCH_DISTRIBUTED_TIMEOUT'] = '7200'

import os
os.environ['NCCL_ASYNC_ERROR_HANDLING'] = '1'  # Recommended for better error reporting
os.environ['NCCL_BLOCKING_WAIT'] = '1'         # Wait for full timeout
os.environ['NCCL_SOCKET_TIMEOUT'] = '600'      # Set a socket timeout in seconds
os.environ['NCCL_DEBUG'] = 'INFO'              # Enable debug logs
```

## 2. Overall Log Screening and Recommendations

# Analysis of YOLO Training Log and Optimization Recommendations

Based on the comprehensive analysis of your training log and research into distributed training best practices, here are the key areas for improvement in your cluster and training job configuration.

## **Major Issues Identified**

### **1. NCCL Network Communication Problems**

The most significant issue in your log is the **failed NCCL network initialization**[1][2][3]. The errors show:

- `NET/OFI aws-ofi-nccl initialization failed`
- `NET/OFI Unable to find a protocol that worked`
- `Using network Socket` (fallback to slower TCP networking)

This means your distributed training is **falling back to slower Socket-based communication** instead of using optimized network fabrics, significantly reducing performance[4][5].

### **2. Suboptimal Batch Size and Worker Configuration**

Your current setup shows **8 dataloader workers** with an unspecified batch size. Research indicates this configuration may not be optimal for your 8-GPU setup[6][7][8].

### **3. Databricks Serverless GPU Beta Limitations**

The warning `serverless_gpu is in Beta. The API is subject to change` indicates you're using experimental infrastructure that may have performance and stability limitations[9][10].

## **Cluster-Level Optimizations**

### **Network Configuration**

**Fix NCCL Network Issues:**
- Set `NCCL_SOCKET_IFNAME=eth0` explicitly in your environment variables[3][11]
- Add `NCCL_DEBUG=INFO` to get detailed networking information[12][11]
- For AWS environments, ensure EFA (Elastic Fabric Adapter) is properly configured if available[4][13]

**Recommended Environment Variables:**
```bash
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
export NCCL_NET="Socket"  # Explicit fallback if EFA unavailable
```

### **Multi-GPU Setup Optimization**

**Move from Serverless to Dedicated GPU Cluster:**
Consider migrating from Databricks Serverless GPU (Beta) to a dedicated multi-GPU cluster for production training[14][15]. **Single-node multi-GPU setups typically outperform multi-node configurations** for YOLO training due to reduced network overhead[14].

**Optimal Hardware Configuration:**
- **Single node with 8 GPUs** is likely faster than 8 nodes with 1 GPU each[14]
- Use **cluster placement groups** to minimize network latency[4]
- Ensure all nodes have **identical PyTorch, CUDA, and NCCL versions**[16]

## **Training Job Optimizations**

### **Batch Size Optimization** 

**Use Automatic Batch Size Detection:**
```python
# Use batch=-1 for automatic optimal batch size calculation
model.train(data="coco128.yaml", epochs=100, batch=-1, device=[0,1,2,3,4,5,6,7])
```

This will automatically determine the **maximum batch size your GPUs can handle**[6][7], which is typically more efficient than manual guessing.

**Manual Batch Size Guidelines:**
- For 8 GPUs: Start with **batch=64** (8 per GPU) and scale up[6]
- **Batch sizes of 16, 32, or 64 typically yield best results**[6]
- Monitor GPU memory usage and increase until you approach memory limits[7]

### **Dataloader Worker Optimization**

**Reduce Worker Count:**
Your current **8 workers may be excessive** for this setup[6][17]. Try:
```python
# Start with fewer workers to reduce RAM usage
model.train(workers=4)  # or workers=2
```

**Memory Management:**
- Disable image caching if experiencing high RAM usage: `cache=False`[17][18]
- Monitor RAM usage during training - high worker counts can exhaust system memory[17]

### **Training Parameters**

**Enable Mixed Precision Training:**
```python
model.train(amp=True)  # Automatic Mixed Precision
```
This can **improve training speed and reduce memory usage** without sacrificing accuracy[7].

**Optimize Image Processing:**
```python
model.train(
    data="coco128.yaml",
    epochs=100, 
    batch=-1,           # Auto-detect optimal batch size
    workers=4,          # Reduced worker count
    device=[0,1,2,3,4,5,6,7],  # All 8 GPUs
    amp=True,           # Mixed precision
    cache=False         # Disable caching if RAM limited
)
```

## **Monitoring and Debugging**

### **Performance Monitoring**

**Add NCCL Debugging:**
Set `NCCL_DEBUG=INFO` to monitor network communication efficiency[3][11]. Look for:
- Successful network initialization messages
- Bandwidth utilization statistics
- Communication pattern optimization

**Track Key Metrics:**
- **GPU utilization** (should be >90% during training)
- **Network bandwidth utilization**
- **Memory usage** (both GPU and system RAM)
- **Training iteration time** and consistency[14]

### **Troubleshooting Steps**

1. **Test NCCL Communication:**
   ```bash
   # Run NCCL tests to verify network performance
   python -c "import torch; torch.distributed.init_process_group('nccl')"
   ```

2. **Verify GPU Topology:**
   Check GPU interconnects and ensure optimal placement[19]

3. **Monitor Resource Usage:**
   Use Databricks cluster metrics to identify bottlenecks[14]

## **Long-term Recommendations**

### **Infrastructure Migration**

**Consider Moving to Production Infrastructure:**
- Migrate from **Serverless GPU (Beta)** to stable, dedicated GPU clusters[9]
- Use **instance types optimized for ML workloads** (e.g., p3, p4, g4 instances on AWS)[4]
- Implement **proper EFA networking** for multi-node scenarios[20][4]

### **Training Strategy**

**Implement Progressive Training:**
- Start with **smaller models and datasets** for parameter tuning
- Use **gradient accumulation** if memory constraints limit batch size[7]
- Consider **staged training** (train for shorter epochs, then resume) to avoid memory accumulation issues[17]

The primary bottleneck in your current setup appears to be the **failed network optimization and suboptimal batch/worker configuration**. Addressing the NCCL networking issues should provide the most significant performance improvement, followed by optimizing batch size and reducing the worker count to prevent memory exhaustion.

Sources
[1] Slow NCCL gradient synchronization in distributed training https://discuss.pytorch.org/t/slow-nccl-gradient-synchronization-in-distributed-training/89625
[2] Model Training with Ultralytics YOLO https://docs.ultralytics.com/modes/train/
[3] NCCL Ignores Specified SOCKET_IFNAME Configuration ... - GitHub https://github.com/NVIDIA/nccl/issues/1581
[4] Optimizing deep learning on P3 and P3dn with EFA - AWS https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/
[5] NCCL performance for Deep Learning workloads on AWS EFA ... https://github.com/NVIDIA/nccl/issues/235
[6] What's an efficient way to fine tune the batch size? #3572 - GitHub https://github.com/ultralytics/ultralytics/issues/3572
[7] Machine Learning Best Practices and Tips for Model Training https://docs.ultralytics.com/guides/model-training-tips/
[8] I am seeing major improvements in my model and the only change ... https://community.ultralytics.com/t/i-am-seeing-major-improvements-in-my-model-and-the-only-change-has-been-the-machine-it-is-trained-on/1019
[9] Serverless GPU compute | Databricks on AWS https://docs.databricks.com/aws/en/compute/serverless/gpu
[10] Serverless GPU compute - Azure Databricks - Microsoft Learn https://learn.microsoft.com/en-us/azure/databricks/compute/serverless/gpu
[11] How to set NCCL_SOCKET_IFNAME · Issue #286 · NVIDIA/nccl https://github.com/NVIDIA/nccl/issues/286
[12] NCCL - CSCS Documentation https://docs.cscs.ch/software/communication/nccl/
[13] Optimizing deep learning on P3 and P3dn with EFA - AWS https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa-part-1/
[14] Best practices for deep learning on Databricks https://docs.databricks.com/aws/en/machine-learning/train-model/dl-best-practices
[15] Multi-GPU and multi-node distributed training | Databricks on AWS https://docs.databricks.com/aws/en/machine-learning/sgc-examples/gpu-distributed-training
[16] Multi node training of YOLOv8 (2 machine with 4GPU each) #7038 https://github.com/ultralytics/ultralytics/issues/7038
[17] High RAM utilization during training - PyTorch Forums https://discuss.pytorch.org/t/high-ram-utilization-during-training/159939
[18] how to avoid high RAM usage · Issue #1467 - GitHub https://github.com/ultralytics/ultralytics/issues/1467
[19] Distributed Parallel Training: PyTorch Multi-GPU Setup in Kaggle T4x2 https://learnopencv.com/distributed-parallel-training-pytorch-multi-gpu-setup/
[20] Get started with EFA and NCCL for ML workloads on Amazon EC2 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html
[21] DDP: multi node training · Issue #6286 - GitHub https://github.com/ultralytics/ultralytics/issues/6286
[22] Multi-GPU Training with YOLOv5 - Ultralytics YOLO Docs https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training/
[23] Neuron Runtime Troubleshooting on Inf1, Inf2 and Trn1 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-troubleshoot.html
[24] Enabling Fast Inference and Resilient Training with NCCL 2.27 https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27/
[25] How to train yolov8 with multi-gpu? · Issue #3308 - GitHub https://github.com/ultralytics/ultralytics/issues/3308
[26] [bug] NCCL WARN NET/OFI Only EFA provider is supported #2675 https://github.com/aws/deep-learning-containers/issues/2675
[27] Issues when trying to train on a multi-GPU device #5244 - GitHub https://github.com/ultralytics/ultralytics/issues/5244
[28] Version · Issue #391 · aws/aws-ofi-nccl - GitHub https://github.com/aws/aws-ofi-nccl/issues/391
[29] Distributed Training: Definition & How it Works - Ultralytics https://www.ultralytics.com/glossary/distributed-training
[30] Using EFA on the DLAMI - AWS Deep Learning AMIs https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-efa-using.html
[31] On the Performance and Memory Footprint of Distributed Training https://arxiv.org/html/2407.02081v1
[32] YOLO v11 training multi-GPU DDP Errors - Stack Overflow https://stackoverflow.com/questions/79372969/yolo-v11-training-multi-gpu-ddp-errors
[33] Distributed Parallel Training Example (GPU) https://www.mindspore.cn/tutorials/experts/en/r2.0.0-alpha/parallel/train_gpu.html
[34] Configure YOLOv8 for GPU: Accelerate Object Detection https://www.digitalocean.com/community/tutorials/yolov8-for-gpu-accelerate-object-detection
[35] Simplifying Training and GenAI Finetuning Using Serverless GPU ... https://www.youtube.com/watch?v=pQMeeQ_jGY0
[36] aws-samples/eks-efa-examples - GitHub https://github.com/aws-samples/eks-efa-examples
[37] Multi-GPU and multi-node distributed training - Azure Databricks https://learn.microsoft.com/en-us/azure/databricks/machine-learning/sgc-examples/gpu-distributed-training
[38] Configuration - Ultralytics YOLO Docs https://docs.ultralytics.com/usage/cfg/
[39] Best practices for performance efficiency | Databricks on AWS https://docs.databricks.com/aws/en/lakehouse-architecture/performance-efficiency/best-practices
[40] YOLOv5 Study: mAP vs Batch-Size #2452 - GitHub https://github.com/ultralytics/yolov5/discussions/2452
[41] High-Performance GPU Memory Transfer on AWS Sagemaker ... https://www.perplexity.ai/hub/blog/high-performance-gpu-memory-transfer-on-aws
[42] ML Training Tip Of The Week #1: Optimizing GPU ... - 86677 https://community.databricks.com/t5/technical-blog/ml-training-tip-of-the-week-1-optimizing-gpu-utilization-in/ba-p/86677
[43] Tips for Best YOLOv5 Training Results - Ultralytics YOLO Docs https://docs.ultralytics.com/yolov5/tutorials/tips_for_best_training_results/
[44] NCCL error when using Sagemaker distributed training without ... https://stackoverflow.com/questions/75064559/nccl-error-when-using-sagemaker-distributed-training-without-specifying-a-distri
[45] Normal then slow then crashing training - YOLO - Ultralytics https://community.ultralytics.com/t/normal-then-slow-then-crashing-training/1203
[46] The usage of video memory fluctuates greatly during YOLO11 training https://github.com/ultralytics/ultralytics/issues/20860
[47] Serverless compute plane networking - Azure Databricks https://learn.microsoft.com/en-us/azure/databricks/security/network/serverless-network-security/
[48] Unable to see NCCL logs - PyTorch Forums https://discuss.pytorch.org/t/unable-to-see-nccl-logs/176114
[49] Optimize GPU utilization while training - YOLO - Ultralytics https://community.ultralytics.com/t/optimize-gpu-utilization-while-training/768
[50] https://raw.githubusercontent.com/aws-samples/awso... https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/efa-cheatsheet.md


**NCCL logs issues + recommendations** map.  

---

## 🚩 Issues Observed in Your Logs

1. **Transport Layer**
   - NCCL is falling back to **`NET/Socket`** transport.  
   - This is functional but **sub‑optimal** for multi‑node training if you have InfiniBand, RoCE, or AWS EFA available.  
   - GPUDirect RDMA (`GDR`) is disabled (`GDR 0`), so GPU memory copies are going through host memory.

2. **CollNet / NVLink**
   - Logs show `2 collnet channels` but also earlier warnings about missing `ncclCollNetPlugin_v10`.  
   - CollNet is provisioned but not actually active.  
   - `MNNVL 0` and `0 nvls channels` confirm no NVLink multi‑node or NVLink‑SHARP acceleration.

3. **Tuner Plugin**
   - NCCL tried to load `libnccl-tuner.so` and failed, falling back to the **internal tuner**.  
   - This is safe, but you lose the ability to auto‑tune thresholds for your specific network.

4. **P2P Support**
   - `intraNodeP2pSupport 0 directMode 0` → no GPU‑to‑GPU direct P2P.  
   - Expected if you only have one GPU per node, but if you *do* have multiple GPUs per node, this means P2P isn’t configured correctly.

5. **Socket Parallelism**
   - Using `2 threads` × `8 sockets per thread`.  
   - This is decent, but may not saturate high‑bandwidth links if you’re on a 100 Gbps+ fabric.

---

## 💡 Recommendations

### 1. Optimize Transport
- If you have **InfiniBand or RoCE**:
  ```bash
  export NCCL_NET=IB
  ```
- If you’re on **AWS with EFA**:
  ```bash
  export NCCL_NET=OFI
  export FI_PROVIDER=efa
  ```
- If you only have Ethernet, sockets are fine, but you can tune them (see below).

### 2. Enable GPUDirect RDMA (if hardware supports it)
- Install Mellanox OFED drivers and ensure `nvidia-peermem` is loaded.  
- Then NCCL should log `GDR 1` instead of `GDR 0`.

### 3. CollNet / Hierarchical Collectives
- If you want CollNet:
  - Rebuild NCCL with `--with-collnet`.  
  - Or install the NCCL package that includes CollNet support.  
- If you don’t need it, disable to avoid noise:
  ```bash
  export NCCL_COLLNET_ENABLE=0
  ```

### 4. Tuner Plugin
- Optional: build or install `libnccl-tuner.so` if you want NCCL to auto‑tune thresholds for your exact network.  
- Otherwise, the internal tuner is fine.

### 5. Socket Backend Tuning
- Increase parallelism if you’re bandwidth‑limited:
  ```bash
  export NCCL_SOCKET_NTHREADS=4
  export NCCL_NSOCKS_PERTHREAD=8
  ```
- Adjust based on CPU/network load.

### 6. P2P (if multi‑GPU per node)
- Ensure GPUs are on the same PCIe root complex.  
- Check with:
  ```bash
  nvidia-smi topo -m
  ```
- If P2P is supported, NCCL should show `intraNodeP2pSupport 1`.

---

## ✅ Summary
Right now, your setup is **working but not optimized**:  
- You’re on **Socket transport** with no GPUDirect, no CollNet, and no tuner plugin.  
- That’s fine for functional correctness, but you’re leaving performance on the table if you have faster interconnects.  

---

Would you like me to build you a **scenario matrix** (single‑node vs multi‑node, Ethernet vs IB/EFA, with/without NVLink) that shows the *optimal NCCL env vars* for each case? That way you’d have a ready‑to‑apply playbook depending on where you run.

In [0]:
from pyspark.ml.torch.distributor import TorchDistributor

import os
from ultralytics import YOLO
import torch
import mlflow
import torch.distributed as dist
from ultralytics import settings
from mlflow.types.schema import Schema, ColSpec
from mlflow.models.signature import ModelSignature
from ultralytics.utils import RANK, LOCAL_RANK

I ran 4-6 tests with different CUDA device settings for the below minimal example to prove it wont work without the right setup.

In [0]:
# data_yaml_path = "coco128.yaml" # ref: https://github.com/ultralytics/ultralytics/blob/main/ultralytics/cfg/datasets/coco128.yaml

@distributed(gpus=8, gpu_type='A10', remote=True)
def train_fn():
  # Start a run to represent the training job
  with mlflow.start_run():
    model = YOLO(f"yolo11n") # shared location
    # model = YOLO("yolo11n")
    model.train(
        task="detect",
        batch=16, # Batch size, with three modes: set as an integer (e.g., batch=16), auto mode for 60% GPU memory utilization (batch=-1), or auto mode with specified utilization fraction (batch=0.70).
        device=[LOCAL_RANK], # need to be LOCAL_RANK, i.e., 0 for this case since we already init_process_group beforehand. RANK wont work. There is no need to specify [0,1] given for example if we have 2 GPUs per node. [0,1] with world_size of 4 or 2 beforehand will both fail. 
        data=data_yaml_path,
        epochs=100,
        project=f'{tmp_project_location}', # local VM ephermal location
        # project=f'{volume_project_location}', # volume path still wont work
        exist_ok=True,
        fliplr=1,
        flipud=1,
        perspective=0.001,
        degrees=.45
    )

train_fn.distributed()   

## : conclusion
## after a few iterations (with screenshots stored locally for error msgs from experiment log), we conclude it wont work for @distributed with simple setup.


#: ----comment below out cause it was for classical GPU compute.----
# distributor = TorchDistributor(num_processes=1, local_mode=True, use_gpu=True)      
# distributor.run(train_fn)
# # on sgc, error: [CONFIG_NOT_AVAILABLE] Configuration spark.master is not available. SQLSTATE: 42K0I