A comprehensive benchmarking CLI for PyTorch Distributed Data Parallel (DDP) clusters. TorchScale automates multi-GPU benchmarks, profiling, and performance analysis to help ML Infrastructure Engineers optimize training throughput on GPU clusters.
- π Automated Benchmarking: Run throughput and latency tests across multiple GPU configurations
- π Profiling Integration: Deep dive into performance with NVIDIA Nsight Systems integration
- π Visual Reports: Generate HTML/PDF reports with scaling efficiency curves and bottleneck analysis
- β System Validation: Check DDP requirements (PyTorch, CUDA, NCCL, Drivers)
- π YAML Configuration: Define reproducible experiment configurations
| Component | Technology | Role |
|---|---|---|
| Core Language | Python 3.9+ | The glue code and main logic |
| Deep Learning | PyTorch (DDP) | Distributed training backend |
| CLI Framework | Typer | Robust command-line interface |
| Profiling | NVIDIA Nsight Systems | Capture CUDA kernel and NCCL traces |
| Orchestration | torchrun | Spawn distributed processes |
| Data Analysis | Pandas | Parse logs and calculate statistics |
| Visualization | Plotly/Matplotlib | Generate performance visualizations |
| Configuration | YAML | Define experiment parameters |
# Clone the repository
git clone https://github.com/ambicuity/Distributed-Deep-Learning-Benchmark-Suite.git
cd Distributed-Deep-Learning-Benchmark-Suite
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .Check if your system meets DDP requirements:
torchscale validateCreate a benchmark_config.yaml file (or use the provided example):
experiment_name: "resnet50_scaling_test"
models:
- "resnet50"
- "bert-large"
batch_sizes:
- 64
- 128
- 256
gpu_counts:
- 1
- 2
- 4
- 8
profiling:
enabled: true
tool: "nsys"
trigger: "sync_stall_detection"Execute throughput and latency tests:
torchscale benchmark --config benchmark_config.yaml --output-dir ./resultsExample Output:
> Running ResNet50 on 4 GPUs (Batch 128)...
> [INFO] Average Throughput: 1,450 img/sec
> [INFO] Iteration Latency: 220ms
> [WARN] High variance in gradient sync time detected.
Deep dive into performance issues:
torchscale profile --gpus 4 --duration 30s --target sync_stallsExample Output:
> [ANALYSIS] Profiling complete.
> [ANALYSIS] Sync stall time: 15.0%
> [ANALYSIS] Found 2 bottleneck(s):
> 1. NCCL all_reduce: GPU-0 finishing backward pass slower than GPU-3
> 2. Gradient synchronization: High variance in gradient sync time
Create a visual performance report:
torchscale report generate --source ./results --format htmlThis generates an HTML report with:
- Scaling efficiency plots (Ideal vs. Actual)
- Throughput and latency statistics
- Bottleneck analysis with recommendations
Usage: torchscale [OPTIONS] COMMAND [ARGS]...
Benchmarking CLI for PyTorch DDP Clusters.
Options:
--verbose / --quiet Enable verbose logging.
--help Show this message and exit.
Commands:
benchmark Run throughput/latency tests across GPU configs.
profile Wrap execution in Nsight Systems to find bottlenecks.
report Generate visual HTML/PDF performance reports.
validate Check if current node meets DDP requirements (NCCL, Drivers).
Run throughput/latency tests:
torchscale benchmark --config <config.yaml> --output-dir <output_dir> [--verbose]Options:
--config, -c: Path to benchmark configuration YAML file (required)--output-dir, -o: Directory to save results (default: ./results)--verbose, -v: Enable verbose logging
Profile execution with Nsight Systems:
torchscale profile --gpus <N> --duration <seconds> --target <target> [--output-dir <dir>]Options:
--gpus, -g: Number of GPUs to profile (default: 4)--duration, -d: Duration in seconds (default: 30)--target, -t: Profiling target (default: sync_stalls)--output-dir, -o: Directory to save results (default: ./results)--verbose, -v: Enable verbose logging
Generate performance reports:
torchscale report --source <results_dir> --format <html|pdf> [--output <file>]Options:
--source, -s: Directory containing benchmark results (default: ./results)--format, -f: Output format - html or pdf (default: html)--output, -o: Output file path (default: auto-generated)--verbose, -v: Enable verbose logging
Validate system requirements:
torchscale validate [--verbose]Options:
--verbose, -v: Enable verbose logging
You notice a 15% performance degradation when scaling from 4 to 8 GPUs.
Phase 1: Configuration
# Create config with GPU scaling matrix
cat > scaling_test.yaml <<EOF
experiment_name: "scaling_analysis"
models: ["resnet50"]
batch_sizes: [128]
gpu_counts: [1, 2, 4, 8]
profiling:
enabled: true
EOFPhase 2: Execution
torchscale benchmark --config scaling_test.yaml --output-dir ./scaling_results --verbosePhase 3: Profiling
torchscale profile --gpus 8 --duration 60s --target sync_stalls --output-dir ./scaling_resultsPhase 4: Analysis
torchscale report generate --source ./scaling_results --format htmlResult: The report identifies that NCCL all_reduce operations are waiting for 15% of cycle time because GPU-0 is finishing its backward pass slower than GPU-7 (system noise or thermal throttling).
experiment_name: "my_experiment" # Experiment identifier
models: # List of models to benchmark
- "resnet50"
- "bert-large"
- "gpt2"
batch_sizes: # Batch sizes per GPU
- 32
- 64
- 128
gpu_counts: # Number of GPUs to test
- 1
- 2
- 4
- 8
profiling: # Profiling configuration
enabled: true # Enable/disable profiling
tool: "nsys" # Profiling tool (nsys)
trigger: "sync_stall_detection" # What to profile- Python 3.9+
- PyTorch 2.0+ with CUDA support
- CUDA-capable GPU(s)
- NCCL library
- NVIDIA Driver
- NVIDIA Nsight Systems (
nsys) for profiling
Distributed-Deep-Learning-Benchmark-Suite/
βββ torchscale/
β βββ __init__.py
β βββ cli/
β β βββ __init__.py
β β βββ main.py # Main CLI application
β βββ core/
β β βββ __init__.py
β β βββ config.py # Configuration management
β β βββ benchmark.py # Benchmark execution
β βββ profiling/
β β βββ __init__.py
β β βββ nsight.py # Nsight Systems integration
β βββ reporting/
β β βββ __init__.py
β β βββ generator.py # Report generation
β βββ utils/
β βββ __init__.py
β βββ validation.py # System validation
βββ benchmark_config.yaml # Example configuration
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup
βββ README.md # This file
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
For issues and questions, please open an issue on the GitHub repository.