<img src="./images/DLI_Header.png" style="width: 400px;">


# 6.0 Mixture of Experts (MoE)

In this notebook, we will learn about Mixture of Experts model training.

## The goals

The goals of this notebook are :
* Learn how to incorporate linear experts on a simple Convolutional Network
* Learn how to train the new Mixture of Experts CNN for classification


### Cancel Previous Running/Pending Jobs

Before moving on, check that no jobs are still running or waiting on the SLURM queue. Let's check the SLURM jobs queue by executing the following cell:



In [1]:
# Check the SLURM jobs queue 
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


If there are still jobs running or pending, execute the following cell to cancel all the user's jobs using the `scancel` command. 

In [2]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


---
# 6.1 Mixture of Experts Introduction

A Mixture of Experts (MoE) is a neural network where some layers are partitioned into small groups that can be activated or not according to the context. 
This structure allows the network to learn a wider range of behaviors. The other advantage is that MoE models will require less computation as only few experts are active at a time.

<img src="images/MOE.png" width="450" />

In the recent literature, several models have been developed following the MoE structure, such as the [Switch Transformer](https://arxiv.org/pdf/2101.03961.pdf).
 

# 6.2 Write the Mixture of Experts for the basline CNN

Back to our CNN cifar-10  classifier model. Let's modify it to add 1 MoE layer. The convolutional layers of the CNN extract features, while the later fully connected layers are specialized for the CIFAR-10 classification problem. 
To add expert layers in the network definition, use the `deepspeed.moe.layer.MoE` as follows (modify the forward pass accordingly):

```
deepspeed.moe.layer.MoE( hidden_size=<Hidden dimension of the model>, 
                         expert=<Torch module that defines the expert>, 
                         num_experts=<Desired number of expert>, 
                         ep_size=<Desired expert-parallel world size>,
                         ...
                         )
                         
```

Learn more about the DeepSpeed Mixture of Experts in the [dedicated DeepSpeed documentation.](https://deepspeed.readthedocs.io/en/latest/moe.html) 

Let's transform the latest fully connected layer `fc3` to a MoE layer in order to evaluate the features extracted from early layers. We will add a final classifier `fc4`.
We already prepared the [cifar10_deepspeed_MOE.py](./code/moe/cifar10_deepspeed_MOE.py) script. Let’s run it using 8 experts partitioned on 4 GPUs, which means that each GPU will handle 2 experts.

In [3]:
!deepspeed --num_gpus=4 /dli/code/moe/cifar10_deepspeed_MOE.py  \
    --deepspeed \
    --deepspeed_config /dli/code/moe/ds_config.json \
    --moe \
    --ep-world-size 4 \
    --num-experts-per-layer 8 \
    --top-k 1 \
    --noisy-gate-policy 'RSample' \
    --moe-param-group \
    --profile-execution=True \
    --profile-name='zero0_MOE'

[2023-06-27 15:02:53,744] [INFO] [runner.py:457:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 /dli/code/moe/cifar10_deepspeed_MOE.py --deepspeed --deepspeed_config /dli/code/moe/ds_config.json --moe --ep-world-size 4 --num-experts-per-layer 8 --top-k 1 --noisy-gate-policy RSample --moe-param-group --profile-execution=True --profile-name=zero0_MOE
[2023-06-27 15:02:54,772] [INFO] [launch.py:96:main] 0 NCCL_P2P_DISABLE=1
[2023-06-27 15:02:54,773] [INFO] [launch.py:96:main] 0 NCCL_VERSION=2.11.4
[2023-06-27 15:02:54,773] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-27 15:02:54,773] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-27 15:02:54,773] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-27 15:02:54,773] [INFO] [launch.py:123:main] dist_w

<img src="images/deepspeed_MOE.png" width="950" />

---
<h2 style="color:green;">Congratulations!</h2>

The next lab will focus on deploying large neural networks.

Before moving on, we need to make sure no jobs are still running or waiting in the queue. 

In [4]:
# Check the SLURM jobs queue 
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


If there are still jobs running or pending, execute the following cell to cancel all the admin user's jobs using the `scancel` command.

In [5]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
