Profiling tools for performance studies of competing ML frameworks on HPC systems
TODO
- Add check to determine if running on Intel GPUs, if so: load
intel_extension_for_{pytorch,deepspeed}
- Modify implementation to add support for Intel GPUs, test on ALCF systems
- Add support for additional (transformer based) model architectures in
src/mlprof/network/pytorch/*
- ideally, support for pulling in arbitrary models from HuggingFace,
torchvision
, etc.
- ideally, support for pulling in arbitrary models from HuggingFace,
- Work on repeating MPI profile experiments with larger batch size / network size using
module load conda/2023-01-10-unstable
on Polaris - Try with single + multiple nodes to measure performance impact
- Write DeepSpeed Trainer that wraps
src/mlprof/network/pytorch/network.py
- Reference: DeepSpeed -- Getting Started
- MPI Profiling to get all collective comm. ops with same model in DeepSpeed, DDP, and Horovod
- Reference: Profiling using
libmpitrace.so
on Polaris
- Reference: Profiling using
- Start with 2 nodes first and next scale w/ increasing number of nodes
- Get profiles for DeepSpeed Zero 1, 2, 3 and Mixture of experts (MoE)
- Identify what parameters can impact performance such as NCCL environment variables and framework-specific parameters
- Do the analysis for standard models and large language models (LLMs)
- Develop auto-tuning methods to set these parameters for optimal performance
-
Associate
mpiprofile
's with backend + attach logs to keep everything together -
Scale up message sizes in mpiprofiles
-
Aggregate into table, grouped by backend
-
Test
fp16
support w/ all backends -
Ensure all GPUs being utilized
- w/
conda/2022-09-08-hvd-nccl
all processes get mapped to GPU0 for some reason
- w/
Note
These instructions assume that your active environment already has the required ML libraries installed.This allows us to perform an isolated editable installation inside our existing environment, and allows it to access previously installed libraries.
To install:
# for ALCF systems, first:
module load conda ; conda activate base
# otherwise, start here:
python3 -m venv venv --system-site-packages
source venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install -e .
We support distributed training using the following backends:
- microsoft/DeepSpeed (
backend=deepspeed
) - horovod/horovod (
backend=horovod
) - pytorch DDP (
backend=DDP
)
which we specify via backend=BACKEND
as an argument to the src/mlprof/train.sh script:
cd src/mlprof
./train.sh backend=BACKEND train.log 2>&1 &
and view the resulting output:
tail -f train.log $(tail -1 logs/latest)
Configuration options can be overridden on the command line, e.g.
(and are specified in src/mlprof/conf/config.yaml
)
./train.sh backend=DDP data.batch_size=256 network.hidden_size=64 > train.log 2>&1 &
Run on Polaris:
qsub \
-A <project-name> \
-q debug-scaling \
-l select=2 \
-l walltime=12:00:00,filesystem=eagle:home:grand \
-I
module load conda/2023-01-10-unstable
conda activate base
git clone https://www.github.com/argonne-lcf/mlprof
cd mlprof
mkdir -p venvs/polaris/2023-01-10
python3 -m venv venvs/polaris/2023-01-10 --system-site-packages
source venvs/polaris/2023-01-10
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install -e .
cd src/mlprof
# TO TRAIN:
./train.sh backend=deepspeed > train.log 2>&1 &
# TO VIEW OUTPUT:
tail -f train.log $(tail -1 logs/latest)
Warning
Running with DeepSpeedIf you're using DeepSpeed directly to launch the multi-node training, you will need to ensure the following environment variables are defined in your
.deepspeed_env
file.The contents of this file should be one environment variable per line, formatted as
KEY=VALUE
.
Each of these environment variables will be explicitly set on every worker node using DeepSpeed.# ------------------------------------------------------------- # the following are necessary when using the DeepSpeed backend export CFLAGS="-I${CONDA_PREFIX}/include/" export LDFLAGS="-L${CONDA_PREFIX}/lib/" echo "PATH=${PATH}" > .deepspeed_env echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> .deepspeed_env echo "https_proxy=${https_proxy}" >> .deepspeed_env echo "http_proxy=${http_proxy}" >> .deepspeed_env echo "CFLAGS=${CFLAGS}" >> .deepspeed_env echo "LDFLAGS=${LDFLAGS}" >> .deepspeed_env # -------------------------------------------------------------
To run an experiment with mpitrace
enabled, on Polaris, we can explicitly set the LD_PRELOAD
environment variable, e.g.
LD_PRELOAD=/soft/perftools/mpitrace/lib/libmpitrace.so ./train.sh > train.log 2>&1 &
which will write MPI Profiling information to a mpi_profile.XXXXXX.Y
file containing the following information:
MPI Profile Results
Data for MPI rank 0 of 8:
Times from MPI_Init() to MPI_Finalize().
-----------------------------------------------------------------------
MPI Routine #calls avg. bytes time(sec)
-----------------------------------------------------------------------
MPI_Comm_rank 3 0.0 0.000
MPI_Comm_size 1 0.0 0.000
MPI_Bcast 2 16.5 0.000
-----------------------------------------------------------------------
total communication time = 0.000 seconds.
total elapsed time = 232.130 seconds.
user cpu time = 122.013 seconds.
system time = 96.950 seconds.
max resident set size = 4064.422 MiB.
-----------------------------------------------------------------
Message size distributions:
MPI_Bcast #calls avg. bytes time(sec)
1 4.0 0.000
1 29.0 0.000
-----------------------------------------------------------------
Summary for all tasks:
Rank 0 reported the largest memory utilization : 4064.42 MiB
Rank 0 reported the largest elapsed time : 232.13 sec
minimum communication time = 0.000 sec for task 6
median communication time = 0.000 sec for task 5
maximum communication time = 0.000 sec for task 4
MPI timing summary for all ranks:
taskid host cpu comm(s) elapsed(s) user(s) system(s) size(MiB) switches
0 x3210c0s37b1n0 0 0.00 232.13 122.01 96.95 4064.42 240460957
1 x3210c0s37b1n0 1 0.00 227.60 126.06 95.88 4001.15 231353798
2 x3210c0s37b1n0 2 0.00 227.63 135.59 85.93 3965.89 230507191
3 x3210c0s37b1n0 3 0.00 227.63 126.33 95.75 4003.07 230342296
4 x3210c0s7b0n0 0 0.00 227.66 137.07 83.80 4039.70 209534784
5 x3210c0s7b0n0 1 0.00 227.64 125.65 96.13 4004.05 230622703
6 x3210c0s7b0n0 2 0.00 227.64 134.53 87.16 3968.59 229010244
7 x3210c0s7b0n0 3 0.00 227.67 125.24 96.90 4004.26 233186459