Skip to content

argonne-lcf/mlprof

Repository files navigation

mlprof

Weights & Biases monitoring pyTorch tensorflow

Profiling tools for performance studies of competing ML frameworks on HPC systems

TODO

TODO

06/05/2023

  • Add check to determine if running on Intel GPUs, if so: load intel_extension_for_{pytorch,deepspeed}
    • Modify implementation to add support for Intel GPUs, test on ALCF systems
  • Add support for additional (transformer based) model architectures in src/mlprof/network/pytorch/*
    • ideally, support for pulling in arbitrary models from HuggingFace, torchvision, etc.

04/17/2023

  • Work on repeating MPI profile experiments with larger batch size / network size using module load conda/2023-01-10-unstable on Polaris
  • Try with single + multiple nodes to measure performance impact

Older

  • Write DeepSpeed Trainer that wraps src/mlprof/network/pytorch/network.py
  • MPI Profiling to get all collective comm. ops with same model in DeepSpeed, DDP, and Horovod
    • Reference: Profiling using libmpitrace.so on Polaris
  • Start with 2 nodes first and next scale w/ increasing number of nodes
  • Get profiles for DeepSpeed Zero 1, 2, 3 and Mixture of experts (MoE)
  • Identify what parameters can impact performance such as NCCL environment variables and framework-specific parameters
  • Do the analysis for standard models and large language models (LLMs)
  • Develop auto-tuning methods to set these parameters for optimal performance

2023-02-20

  • Associate mpiprofile's with backend + attach logs to keep everything together

  • Scale up message sizes in mpiprofiles

  • Aggregate into table, grouped by backend

  • Test fp16 support w/ all backends

  • Ensure all GPUs being utilized

    • w/ conda/2022-09-08-hvd-nccl all processes get mapped to GPU0 for some reason

Setup

Note
These instructions assume that your active environment already has the required ML libraries installed.

This allows us to perform an isolated editable installation inside our existing environment, and allows it to access previously installed libraries.

To install:

# for ALCF systems, first:
module load conda ; conda activate base
# otherwise, start here:
python3 -m venv venv --system-site-packages
source venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install -e .

Running Experiments

We support distributed training using the following backends:

which we specify via backend=BACKEND as an argument to the src/mlprof/train.sh script:

cd src/mlprof
./train.sh backend=BACKEND train.log 2>&1 &

and view the resulting output:

tail -f train.log $(tail -1 logs/latest)

Configuration

Configuration options can be overridden on the command line, e.g. (and are specified in src/mlprof/conf/config.yaml)

./train.sh backend=DDP data.batch_size=256 network.hidden_size=64 > train.log 2>&1 &

Running on Polaris

Run on Polaris:

qsub \
  -A <project-name> \
  -q debug-scaling \
  -l select=2 \
  -l walltime=12:00:00,filesystem=eagle:home:grand \
  -I
module load conda/2023-01-10-unstable
conda activate base
git clone https://www.github.com/argonne-lcf/mlprof
cd mlprof
mkdir -p venvs/polaris/2023-01-10
python3 -m venv venvs/polaris/2023-01-10 --system-site-packages
source venvs/polaris/2023-01-10
python3 -m pip install --upgrade pip setuptools wheel
python3 -m pip install -e .
cd src/mlprof
# TO TRAIN:
./train.sh backend=deepspeed > train.log 2>&1 &
# TO VIEW OUTPUT:
tail -f train.log $(tail -1 logs/latest)

Warning
Running with DeepSpeed

If you're using DeepSpeed directly to launch the multi-node training, you will need to ensure the following environment variables are defined in your .deepspeed_env file.

The contents of this file should be one environment variable per line, formatted as KEY=VALUE.
Each of these environment variables will be explicitly set on every worker node using DeepSpeed.

# -------------------------------------------------------------
# the following are necessary when using the DeepSpeed backend
export CFLAGS="-I${CONDA_PREFIX}/include/"
export LDFLAGS="-L${CONDA_PREFIX}/lib/" 
echo "PATH=${PATH}" > .deepspeed_env 
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> .deepspeed_env
echo "https_proxy=${https_proxy}" >> .deepspeed_env
echo "http_proxy=${http_proxy}" >> .deepspeed_env 
echo "CFLAGS=${CFLAGS}" >> .deepspeed_env
echo "LDFLAGS=${LDFLAGS}" >> .deepspeed_env
# -------------------------------------------------------------

Profiling

To run an experiment with mpitrace enabled, on Polaris, we can explicitly set the LD_PRELOAD environment variable, e.g.

LD_PRELOAD=/soft/perftools/mpitrace/lib/libmpitrace.so ./train.sh > train.log 2>&1 &

which will write MPI Profiling information to a mpi_profile.XXXXXX.Y file containing the following information:

MPI Profile Results

Data for MPI rank 0 of 8:
Times from MPI_Init() to MPI_Finalize().
-----------------------------------------------------------------------
MPI Routine                        #calls     avg. bytes      time(sec)
-----------------------------------------------------------------------
MPI_Comm_rank                           3            0.0          0.000
MPI_Comm_size                           1            0.0          0.000
MPI_Bcast                               2           16.5          0.000
-----------------------------------------------------------------------
total communication time = 0.000 seconds.
total elapsed time       = 232.130 seconds.
user cpu time            = 122.013 seconds.
system time              = 96.950 seconds.
max resident set size    = 4064.422 MiB.

-----------------------------------------------------------------
Message size distributions:

MPI_Bcast                 #calls    avg. bytes      time(sec)
                               1           4.0          0.000
                               1          29.0          0.000

-----------------------------------------------------------------

Summary for all tasks:

  Rank 0 reported the largest memory utilization : 4064.42 MiB
  Rank 0 reported the largest elapsed time : 232.13 sec

  minimum communication time = 0.000 sec for task 6
  median  communication time = 0.000 sec for task 5
  maximum communication time = 0.000 sec for task 4


MPI timing summary for all ranks:
taskid             host    cpu    comm(s)  elapsed(s)     user(s)   system(s)   size(MiB)    switches
     0   x3210c0s37b1n0      0       0.00      232.13      122.01       96.95     4064.42   240460957
     1   x3210c0s37b1n0      1       0.00      227.60      126.06       95.88     4001.15   231353798
     2   x3210c0s37b1n0      2       0.00      227.63      135.59       85.93     3965.89   230507191
     3   x3210c0s37b1n0      3       0.00      227.63      126.33       95.75     4003.07   230342296
     4    x3210c0s7b0n0      0       0.00      227.66      137.07       83.80     4039.70   209534784
     5    x3210c0s7b0n0      1       0.00      227.64      125.65       96.13     4004.05   230622703
     6    x3210c0s7b0n0      2       0.00      227.64      134.53       87.16     3968.59   229010244
     7    x3210c0s7b0n0      3       0.00      227.67      125.24       96.90     4004.26   233186459

About

Profiling tools for performance studies of competing ML frameworks on HPC systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published