mlprof

Profiling tools for performance studies of competing ML frameworks on HPC systems

TODO

TODO

06/05/2023

Add check to determine if running on Intel GPUs, if so: load intel_extension_for_{pytorch,deepspeed}
- Modify implementation to add support for Intel GPUs, test on ALCF systems
Add support for additional (transformer based) model architectures in src/mlprof/network/pytorch/*
- ideally, support for pulling in arbitrary models from HuggingFace, torchvision, etc.

04/17/2023

Work on repeating MPI profile experiments with larger batch size / network size using module load conda/2023-01-10-unstable on Polaris
Try with single + multiple nodes to measure performance impact

Older

Write DeepSpeed Trainer that wraps src/mlprof/network/pytorch/network.py
- Reference: DeepSpeed -- Getting Started
MPI Profiling to get all collective comm. ops with same model in DeepSpeed, DDP, and Horovod
- Reference: Profiling using libmpitrace.so on Polaris
Start with 2 nodes first and next scale w/ increasing number of nodes
Get profiles for DeepSpeed Zero 1, 2, 3 and Mixture of experts (MoE)
Identify what parameters can impact performance such as NCCL environment variables and framework-specific parameters
Do the analysis for standard models and large language models (LLMs)
Develop auto-tuning methods to set these parameters for optimal performance

2023-02-20

Associate mpiprofile's with backend + attach logs to keep everything together
Scale up message sizes in mpiprofiles
Aggregate into table, grouped by backend
Test fp16 support w/ all backends
Ensure all GPUs being utilized
- w/ conda/2022-09-08-hvd-nccl all processes get mapped to GPU0 for some reason

Setup

Note
These instructions assume that your active environment already has the required ML libraries installed.

This allows us to perform an isolated editable installation inside our existing environment, and allows it to access previously installed libraries.

To install:

# for ALCF systems, first:
module load conda ; conda activate base
# otherwise, start here:
python3 -m venv venv --system-site-packages
source venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install -e .

Running Experiments

We support distributed training using the following backends:

microsoft/DeepSpeed (backend=deepspeed)
horovod/horovod (backend=horovod)
pytorch DDP (backend=DDP)

which we specify via backend=BACKEND as an argument to the src/mlprof/train.sh script:

cd src/mlprof
./train.sh backend=BACKEND train.log 2>&1 &

and view the resulting output:

tail -f train.log $(tail -1 logs/latest)

Configuration

Configuration options can be overridden on the command line, e.g. (and are specified in src/mlprof/conf/config.yaml)

./train.sh backend=DDP data.batch_size=256 network.hidden_size=64 > train.log 2>&1 &

Running on Polaris

Run on Polaris:

qsub \
  -A <project-name> \
  -q debug-scaling \
  -l select=2 \
  -l walltime=12:00:00,filesystem=eagle:home:grand \
  -I
module load conda/2023-01-10-unstable
conda activate base
git clone https://www.github.com/argonne-lcf/mlprof
cd mlprof
mkdir -p venvs/polaris/2023-01-10
python3 -m venv venvs/polaris/2023-01-10 --system-site-packages
source venvs/polaris/2023-01-10
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install -e .
cd src/mlprof
# TO TRAIN:
./train.sh backend=deepspeed > train.log 2>&1 &
# TO VIEW OUTPUT:
tail -f train.log $(tail -1 logs/latest)

Warning
Running with DeepSpeed

If you're using DeepSpeed directly to launch the multi-node training, you will need to ensure the following environment variables are defined in your .deepspeed_env file.

The contents of this file should be one environment variable per line, formatted as KEY=VALUE.
Each of these environment variables will be explicitly set on every worker node using DeepSpeed.
# -------------------------------------------------------------
# the following are necessary when using the DeepSpeed backend
export CFLAGS="-I${CONDA_PREFIX}/include/"
export LDFLAGS="-L${CONDA_PREFIX}/lib/" 
echo "PATH=${PATH}" > .deepspeed_env 
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> .deepspeed_env
echo "https_proxy=${https_proxy}" >> .deepspeed_env
echo "http_proxy=${http_proxy}" >> .deepspeed_env 
echo "CFLAGS=${CFLAGS}" >> .deepspeed_env
echo "LDFLAGS=${LDFLAGS}" >> .deepspeed_env
# -------------------------------------------------------------

Profiling

To run an experiment with mpitrace enabled, on Polaris, we can explicitly set the LD_PRELOAD environment variable, e.g.

LD_PRELOAD=/soft/perftools/mpitrace/lib/libmpitrace.so ./train.sh > train.log 2>&1 &

which will write MPI Profiling information to a mpi_profile.XXXXXX.Y file containing the following information:

MPI Profile Results

Data for MPI rank 0 of 8:
Times from MPI_Init() to MPI_Finalize().
-----------------------------------------------------------------------
MPI Routine                        #calls     avg. bytes      time(sec)
-----------------------------------------------------------------------
MPI_Comm_rank                           3            0.0          0.000
MPI_Comm_size                           1            0.0          0.000
MPI_Bcast                               2           16.5          0.000
-----------------------------------------------------------------------
total communication time = 0.000 seconds.
total elapsed time       = 232.130 seconds.
user cpu time            = 122.013 seconds.
system time              = 96.950 seconds.
max resident set size    = 4064.422 MiB.

-----------------------------------------------------------------
Message size distributions:

MPI_Bcast                 #calls    avg. bytes      time(sec)
                               1           4.0          0.000
                               1          29.0          0.000

-----------------------------------------------------------------

Summary for all tasks:

  Rank 0 reported the largest memory utilization : 4064.42 MiB
  Rank 0 reported the largest elapsed time : 232.13 sec

  minimum communication time = 0.000 sec for task 6
  median  communication time = 0.000 sec for task 5
  maximum communication time = 0.000 sec for task 4


MPI timing summary for all ranks:
taskid             host    cpu    comm(s)  elapsed(s)     user(s)   system(s)   size(MiB)    switches
     0   x3210c0s37b1n0      0       0.00      232.13      122.01       96.95     4064.42   240460957
     1   x3210c0s37b1n0      1       0.00      227.60      126.06       95.88     4001.15   231353798
     2   x3210c0s37b1n0      2       0.00      227.63      135.59       85.93     3965.89   230507191
     3   x3210c0s37b1n0      3       0.00      227.63      126.33       95.75     4003.07   230342296
     4    x3210c0s7b0n0      0       0.00      227.66      137.07       83.80     4039.70   209534784
     5    x3210c0s7b0n0      1       0.00      227.64      125.65       96.13     4004.05   230622703
     6    x3210c0s7b0n0      2       0.00      227.64      134.53       87.16     3968.59   229010244
     7    x3210c0s7b0n0      3       0.00      227.67      125.24       96.90     4004.26   233186459

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
Communication_Profile_On_One_Polaris_Node		Communication_Profile_On_One_Polaris_Node
Communication_Profile_On_Two_Polaris_Nodes		Communication_Profile_On_Two_Polaris_Nodes
mpiprofiles		mpiprofiles
src/mlprof		src/mlprof
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlprof

TODO

06/05/2023

04/17/2023

Older

2023-02-20

Setup

Running Experiments

Configuration

Running on Polaris

Profiling

About

Releases

Packages

Contributors 2

Languages

argonne-lcf/mlprof

Folders and files

Latest commit

History

Repository files navigation

mlprof

TODO

06/05/2023

04/17/2023

Older

2023-02-20

Setup

Running Experiments

Configuration

Running on Polaris

Profiling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages