### Computer Vision using DeepSpeed and MPI

In this advanced tutorial, you will learn more about ``deepspeed`` and the Zero Redundancy algorithms (ZeRO).  These algorithms are part of the magic of ``deepspeed`` which enables training of billion plus parameter models on small clusters of GPUs.  Let's get started!

#### What the learner will be able to do at the end of this tutorial:

- Execute a simple computer vision pipeline
- Explain how to modify code to run in the DeepSpeed framework
- Understand the subtle difference between running within the DeepSpeed framework versus using MPI to run DeepSpeed pipelines
- Assess whether a pipeline would be more efficiently run in python or a DeepSpeed framework
- Contrast ZeRO optimizers in experiments by changing the configuration file

#### What is DeepSpeed?

DeepSpeed, was originally released in early 2020 as an open-source software from Microsoft Research. DeepSpeed is a framework for training and doing inference with very large parameter, high memory usage models.  It is built to run on MPI which allows DeepSpeed to manage resources fluidly.  See the beginner's MPI notebook for information on MPI works and shares memory. The resources DeepSpeed manages include shared memory available on GPUs, CPUs and Non-Volatile Memory express or NVMe for short.  As a gentle reminder GPU hardware also contains a limited number of CPUs to run some basic operations. 

#### What are ZeRO optimization algorithms?

The ZeRO algorithms contained in DeepSpeed take advantage of all the hardware memory on GPUs. Scaling with DeepSpeed and its ZeRO optimization algorithms enables scaling models from those that can run on a single GPU to sizes that run on High Performance Clusters of GPUs.  Models trained can be over a billion parameters wide. Open-sourced models tested and trained with DeepSpeed include Megatron, GPT-2, Yanex's YaML-100B.

#### How do ZeRO optimization algorithms help with training large parameter models or models with large datasets?

As we mentioned, DeepSpeed takes advantage of the ZeRO (Zero Redundancy Optimizer) algorithms.  These include:

- ***ZeRO (stage 1)*** - an algorithm in which an Adam optimizer, 32-bit weights, the first, and second moment estimates are partitioned across the processes, so that each process updates only its partition.

- ***ZeRO2 (stage 2)*** - an algorithm which uses reduced 32-bit gradients for updating the model weights.  Weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.

- ***ZeRO3 (stage 3)*** - an algorithm which partitions a 16-bit model's parameters across the processes. ZeRO-3 will automatically collect and partition parameters and weights during the forward and backward passes.

There are many other components of DeepSpeed and the library is evolving its technologies.  Check out the references at the end of the tutorial and Microsoft's blogs on DeepSpeed.

For more detailed information see Microsoft's blog article: [Zero Redundancy Optimizers](https://www.deepspeed.ai/tutorials/zero/)

#### Getting Started on this Tutorial

In this tutorial we will look at training a computer vision pipeline with DeepSpeed.  We will run the code in more than one way, demonstrating that the command ``deepspeed`` can be used, or the code can be run with ``mpirun`` command.  The advantage of ``mpirun`` is more granular control of the run.  For example one can set the number of processes, ``map-by``to slot number, and ``bind-to`` configurations using ``mpirun``.  Let's dive right in!

As a best practice, lets start off by checking that the host files are correctly listed when the hostname command is called.  This indicates to us that the cluster is set-up to run correctly.  If the hostnames do not show up, it is best to restart the kernal and then *sync* the MPI workers again.

In [None]:
!mpirun hostname

Now just for comparison's sake below is the method to run the cifar example code without MPI or DeepSpeed.  Do not run this in class as it will take a long time to finish.  You can run this after the workshop on your own to make a comparison between methods.

In [None]:
%%time

# run this at home or when time is not an issue -- it will take much longer than running with MPI or deepspeed

#!python /mnt/data/DeepSpeed_Repository_Code/DeepSpeedExamples-master/cifar/cifar10_tutorial.py

#### Running our Pipeline on MPI without DeepSpeed

Now let's look to see what happens when we use the same ``mpirun`` command as before without the use of DeepSpeed.  How long does it take to run?  Why do you think it takes that amount of time?  Hint: when we run code with ``mpirun`` alone, we're not using memory optimization techniques.  That means that there is more information / data in cache and it has not been offloaded to CPU.

In [None]:
%%time

!mpirun -np 2 python cifar10_tutorial.py

#### Using Code Initialized for DeepSpeed

Now let's experiment with ``deepspeed``.  Take a look at the file called ``cifar10_deepspeed.py``.  Notice that the following code has been added to the file.  Looking as the ``deepspeed`` repository we discover that to initialize ``deepspeed`` on a cluster we need to initialize with a ``deepspeed`` command instead of a ``torch_distributed`` command.

Take a look at this code snippet.

```
# Initialize DeepSpeed to use the following features
# 1) Distributed model
# 2) Distributed data loader
# 3) DeepSpeed optimizer
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
    args=args, model=net, model_parameters=parameters, training_data=trainset)

fp16 = model_engine.fp16_enabled()
print(f'fp16={fp16}')
```
In the examples below we try the following and see how the time compares:

- Training and inference using the ``deepspeed`` command and a zero optimizer.

- Training and inference using the ``deepspeed`` command and a zero2 optimizer.

- Training and inference using the ``mpirun`` command and a zero optimizer.

What do you think the difference is between using the ``deepspeed`` command and the ``mpirun`` command?  Hint: DeepSpeed is a framework built on top of MPI.  Do you think its quicker to use MPI without casting the ``deepspeed`` framework?

#### What does it take to run distributed compute using DeepSpeed?

There a couple of minor code changes required when using pytorch in order to run in the deepspeed framework or on an MPI / deepspeed pipeline.  
    
Here is a line of code required to run on ``deepspeed``:

```
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
    args=args, model=net, model_parameters=parameters, training_data=trainset)

fp16 = model_engine.fp16_enabled()
print(f'fp16={fp16}')

```
Of particular note is the function ```deepspeed.initialize()``` which tells the code to run the ``deepspeed`` framework.  It ***does not*** indicate which optimizer to use.  That information is present in the configuration file.

There may be other modifications required to your file but its best to start very simply with the modifications above which we used for this computer vision example.  If you have more questions or would like specialized help with your code, please reach out to the FDS team at Domino Data Lab.

In [None]:
%%time

## running without a host file, with the Zero optimizer

!deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config_simple.json

Now let's take a look at what happens when we use the hostfiles.  Remember in order to recognize the GPUs / cluster we need to give ``deepspeed`` or ``mpirun`` the hostfile names and addresses.  How does the time change?  How does the output change?

In [None]:
%%time

## run with the mpi hostfiles and stage zero optimization, in this example the pipeline will run on every mpi 'host'

!deepspeed --hostfile /domino/mpi/hosts cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config2.json

#### Running a DeepSpeed file using MPI

As long as ***mpi4py*** is installed along with ***deepspeed***, the ``mpirun`` command can be used, just like we did for the calculation of pi examples.  Try running the code below 'as is' and then try adding a few of the variables.  Here are some things to think about.

- Change the number of processes.  How does this affect the wall clock time?  Do fewer processes speed up the code?  Hint, think about parallelism

- Try using the ``mpirun`` commands ``map-by`` and ``bind-to``.  What affect, if any, does this have on how the code runs?

- Why do you think OpenMPI command ``mpirun`` makes DeepSpeed run faster than using ``deepspeed`` command only?

- What do you think happens if we used DeepSpeed ZeRO2 memory optimizer?

In [None]:
%%time

!mpirun -np 2 --hostfile /domino/mpi/hosts python cifar10_deepspeed.py \
--deepspeed_config ds_config_zero2.json

#### Changing the DeepSpeed Zero Optimizer

Now let's take this simple example and see what happens when we use the Zero optimizer 2 instead of 1 (as in the original configuration file).  Does the code run any differently?  Do you think this would be a more impactful difference if it were a bigger model with more data?

Code changes:

```
    {
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "allgather_bucket_size": 5e8}
    }
```

Look at the json file called ds_config_zero2.json.  See if you can find the code change.  Now try running the mpi code again with the ZeRO3 optimizer. Hint: Why does it take slightly longer to run when more memory is offloaded?  Under what circumstance might this be helpful?

In [None]:
%%time

!mpirun -np 2 --hostfile /domino/mpi/hosts python cifar10_deepspeed.py \
--deepspeed_config ds_config_zero2.json

#### What we learned in this tutorial

- DeepSpeed runs on MPI (under the hood).

- MPI can be used to run DeepSpeed files or one can choose the DeepSpeed framework.

- Optimizing memory is as simple as changing one or two lines of a configuration file.

- Code changes are required to a ``pytorch`` file in order to run in ``deepspeed`` in distributed mode are simple.

- DeepSpeed is appropriate for large parameter models because it handles memory well.

- Deepspeed / MPI clusters do not have a user-interface, but the output is verbose.


#### More resources on DeepSpeed and MPI

There are some excellent references and tutorials on DeepSpeed.  Try some of the examples on your own or try a large model like GTP-2 and see how it performs.  DeepSpeed will enable the use of less hardware and greater memory efficiency.  However unlike the name, for smaller models or less data, DeepSpeed may not necessarily speed up code runtime.  DeepSpeed however is extremely powerful for training very large models in particular large language or transformer models like GPT-2, Megatron or Yanex's YaML-100B billion parameter model.  Read some of the references below to learn more.
    
- [Basics of MPI](https://www.markdownguide.org/basic-syntax/)

- [OpenMPI Documentation](https://www.open-mpi.org/faq/?category=running#simple-launch)

- [DeepSpeed Documentation](https://www.deepspeed.ai/)

- [DeepSpeed Repository with more Practice Exercises](https://github.com/microsoft/DeepSpeedExamples)

- [Training a very large model with hugging face -- try some of these at home!](https://huggingface.co/docs/transformers/main_classes/deepspeed)

- [Train Megatron using DeepSpeed](https://www.deepspeed.ai/tutorials/megatron/)