<img src="./images/DLI_Header.png" style="width: 400px;">

# Assessment

## Overview

The goal of this assessment is to evaluate your ability to build and execute large models. Please demonstrate that ability by porting an existing piece of code into DeepSpeed and creating a series of configuration files to enable a range of DeepSpeed functions including: activation checkpointing, mixed precision training as well as ZeRo redundancy optimizer. 

To make the task containable, we have deliberately selected a simplified codebase, namely minGPT (https://github.com/karpathy/minGPT). This is a minimalistic implementation of Transformers that will not provide maximum performance, but which is representative and should allow you to complete this coding exercise in a relatively short period of time.

In this task, we will look at yet another family of models, namely, Vision Transformers. Before diving into the assignment, please review the [code example](minGPT/minGPT/play_image.ipynb) that we will be using in this assessment. Feel free to execute the above code example but do bear in mind that training to convergence will take a considerable amount of time, so it  might help to finish it early and focus on the code migration discussed below.

## Introduction

Conceptually, our goal will be to:
- Migrate a standalone pytorch implementation of the training pipeline into DeepSpeed and train effectively on our "two server" cluster
- Enable functionality that will allow for memory saving, namely: Mixed Precision Training, Activation Checkpointing and ZeRo Redundancy optimiser
- Increase the size of the model being trained

The below notebook will be a guide through the process and provide test code which will help determine whether you are on the right path to the correct solution. By the end of the assessment, when the code is complete, you will be asked to go back to the lab platform and press the `assess` button. This will trigger an automated process which will load your code files as well as the deepspeed configuration files and execute them, assessing correctness of the implementation. Please leave enough time to execute this step as it can take several minutes to compleate. If you are running out of time, please download the files you have modified so that they can be finished later.

### Hints
* There are many different files in this assessment. To help keep track of what needs to be updated, we've placed `FIXME`s in relevant locations. If running a file results in an error, please look for a `FIXME`.
* We will be processing a lot of data in this assessment. Please be patient with the hardware, and wait a minute between cancelling and running jobs.
* If a clean slate is needed, please restart the server. Please download the following files and upload them in order to resume your work:
    * [Assessment.ipynb](./Assessment.ipynb)
    * [runFirstDeepSpeed.py](./minGPT/minGPT/runFirstDeepSpeed.py)
    * [trainer.py](./minGPT/minGPT/mingpt/trainer.py)
    * [model.py](./minGPT/minGPT/mingpt/model.py)
    * [runStep5.py](./minGPT/minGPT/runStep5.py)

Good luck!

## Step 1: Baseline implementation

Let us begin by looking at the starting point of our assessment, namely [runStartingPoint.py](./minGPT/minGPT/runStartingPoint.py). This is the same code that was reviewed earlier, just extracted into a python file to allow us for its batch execution. Let us test it to make sure it works in a standalone mode. Once again, training to convergence will take a substantial amount of time, so once you see training progress feel free to stop the training process and move to the next step.

In [None]:
!python minGPT/minGPT/runStartingPoint.py

## Step 2: Enabling DeepSpeed

Let's start by adapting the previous training scripts to use the DeepSpeed library by making some minimalistic changes in the code. To do so, you will need to:

&nbsp; &nbsp; 1.  Modify the relevant sections in [runFirstDeepSpeed.py](./minGPT/minGPT/runFirstDeepSpeed.py)   
&nbsp; &nbsp; 2.  Modify the relevant sections in [trainer.py](./minGPT/minGPT/mingpt/trainer.py)   
&nbsp; &nbsp; 3.  Create the DeepSpeed configuration file `ds_config_basic.json`   
&nbsp; &nbsp; 4.  Run the training with `deepspeed` command


### 1.  Modify the "ToDo Step 2" sections in the file `runFirstDeepSpeed.py`
Open the file [runFirstDeepSpeed.py](./minGPT/minGPT/runFirstDeepSpeed.py) and define the "ToDo Step 2" sections to port the code on DeepSpeed. There are 4 sections to be defined.

### 2.  Modify the "ToDo Step 2" sections in the `trainer.py`
Open the file [trainer.py](./minGPT/minGPT/mingpt/trainer.py) and implement the `DeepSpeedTrainer` class by defining the "ToDo Step 2" sections. There are 6 sections to be modified/implemented.

### 3.  Create the DeepSpeed configuration file `ds_config_basic.json`
In the next cell, change the `FIXME` to set:
- The micro-batch size per gpu to 8
- Make sure to enable Adam optimizer and copy the learning rate from the original code [runStartingPoint.py](./minGPT/minGPT/runStartingPoint.py)
- Set the gradient clipping to the value used in the original code [runStartingPoint.py](./minGPT/minGPT/runStartingPoint.py)

In [None]:
%%writefile ./minGPT/minGPT/ds_config_basic.json
{
  "train_micro_batch_size_per_gpu": #FIXME,
  "optimizer": {
    "type": #FIXME,
    "params": {
      "lr": #FIXME
    }
  },
  "gradient_clipping": #FIXME
}

### 4.  Run the training with `deepspeed` command

The following command should result in 4 GPU training and we should see the training progress. Once again, the goal of this exercise is not to train this model to convergence. Once you see training taking place, you can interrupt the execution and move to the next step.

In [None]:
!deepspeed minGPT/minGPT/runFirstDeepSpeed.py --deepspeed --deepspeed_config minGPT/minGPT/ds_config_basic.json

## Step 3: Multi node execution

The above code executed on 4 GPUs for this particular node, but our goal is to make it work across the two nodes we have used earlier in the class. Please reuse the code we have worked with earlier to launch a `2` node job executing the above. Let us start by creating the appropriate shell script:

In [None]:
%%writefile ./minGPT/minGPT/runSlurmStep3.sh
#!/bin/bash
#SBATCH --job-name=dli_assessment_step3
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1       
#SBATCH --cpus-per-task=32 ### Number of threads per task (OMP threads)
#SBATCH -o /dli/megatron/logs/%j.out
#SBATCH -e /dli/megatron/logs/%j.err

# Number of nodes
NUM_NODES=#FIXEME
# Number of GPUs per node
NUM_GPUS=#FIXEME

deepspeed --num_nodes=${NUM_NODES} --hostfile /dli/minGPT/minGPT/hostfile --num_gpus=${NUM_GPUS} /dli/minGPT/minGPT/runFirstDeepSpeed.py \
    --deepspeed \
    --deepspeed_config #FIXEME

Please modify the below to enable multi-node execution. Please use the below command to execute your multi-node job (this is the command that will be used for assessment so do not change the file names or paths).

In [None]:
!sbatch ./minGPT/minGPT/runSlurmStep3.sh
!squeue

Once the above executes, we should be able to see output and error logs with the commands below. Make sure to copy the job ID to the below command. Once again, make sure the code deploys logs out the below location with the below file name structure as those will be inspected for the assessment.

In [None]:
!JOB_ID=TODO_ENTER_JOB_ID;cat /dli/megatron/logs/$JOB_ID.out

In [None]:
!JOB_ID=TODO_ENTER_JOB_ID;cat /dli/megatron/logs/$JOB_ID.err

Once you are happy with your code, please make sure the batch job is terminated before going to the next step.

In [None]:
!squeue

In [None]:
!scancel  #PASTE_JOB_ID_HERE

## Step 4: Further code improvement

We are missing capability to do activation checkpointing. In this step, we will introduce code that will allow us to do activation checkpointing with DeepSpeed library.

&nbsp; &nbsp; 1. Define the transformer blocks for activation checkpointing   
&nbsp; &nbsp; 2. Create the DeepSeed configuration file enabeling activation checkpointing and FP16 training   
&nbsp; &nbsp; 3. Create and run the sbatch training file  

### 1. Define the transformer blocks for activation checkpointing

To enable activation checkpointing of a model (or part of the model) with DeepSpeed, at the forward pass definition, we need to wrap each block with the function `deepspeed.checkpointing.checkpoint()` ([learn more](https://deepspeed.readthedocs.io/en/stable/activation-checkpointing.html#deepspeed.checkpointing.checkpoint)). 

The example bellow shows a simple convolutional Network definition with 2 CNN blocks followed by a linear layer in which the CNN blocks are wrapped for activation checkpointing with DeepSpeed.

```
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn_block_1 = nn.Sequential(*[nn.Conv2d(3, 32, 3, padding=1),nn.ReLU(),nn.MaxPool2d(kernel_size=2)])
        self.cnn_block_2 = nn.Sequential(*[nn.Conv2d(64, 64, 3, padding=1),nn.ReLU(),nn.MaxPool2d(kernel_size=2)])
        self.flatten = lambda inp: torch.flatten(inp, 1)
        self.linearize = nn.Sequential(*[ nn.Linear(64 * 8 * 8, 512),nn.ReLU()])
        self.out = nn.Linear(512, 10)
    
    def forward(self, X):
        X = deepspeed.checkpointing.checkpoint(self.cnn_block_1, X)
        X = deepspeed.checkpointing.checkpoint(self.cnn_block_2, X)
        X = self.flatten(X)
        X = self.linearize(X)
        X = self.out(X)
        return X

```
A similar mechanism is implemented with torch via the function `torch.utils.checkpoint.checkpoint()`.


In our case, the VisionTransformer model is implemented as the GPT class in the file `./minGPT/minGPT/mingpt/model.py`. 

**TODO:** Modify the "Step 4 ToDo" task in [model.py](./minGPT/minGPT/mingpt/model.py) file in order to make the transformer blocks wrapped by the DeepSpeed activation checkpointing. Replace `x = self.blocks(x)` with:
```
for block in self.blocks:
    x = deepspeed.checkpointing.checkpoint(block, x)
```


### 2. Create the DeepSeed configuration file

Before starting, you can check the DeepSpeed documentation of the config-json file for the [activation-checkpointing.](https://www.deepspeed.ai/docs/config-json/#activation-checkpointing)

Create the `ds_config_step4.json` by modifying the `#FIXME` in the cell bellow to:
- Enable activation checkpointing
- Make the micro batch size per GPU to 128 to make sure activation checkpointing is working well
- Make the number of checkpoints to 12
- Enable FP16 training



In [None]:
%%writefile minGPT/minGPT/ds_config_step4.json
{
  "train_micro_batch_size_per_gpu": #FIXME,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 3e-4
    }
  },
  "gradient_clipping": 1.0,
  "activation_checkpointing": {
    "partition_activations": #FIXME,
    "cpu_checkpointing": #FIXME,
    "contiguous_memory_optimization": #FIXME,
    "number_checkpoints": 12,
    "synchronize_checkpoint_boundary": #FIXME,
    "profile": #FIXME
    },
  "fp16": {
    "enabled": true
  }
}

### 3. Run the sbatch training file


Let's start by creating copies of the training python scripts `runFirstDeepSpeed.py`.

In [None]:
!cp /dli/minGPT/minGPT/runFirstDeepSpeed.py /dli/minGPT/minGPT/runStep4.py

Let's now create the sbatch file `runSlurmStep4.sh`.

In [None]:
%%writefile ./minGPT/minGPT/runSlurmStep4.sh
#!/bin/bash
#SBATCH --job-name=dli_assessment_step4
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1       
#SBATCH --cpus-per-task=32 ### Number of threads per task (OMP threads)
#SBATCH -o /dli/megatron/logs/%j.out
#SBATCH -e /dli/megatron/logs/%j.err

# Number of nodes
NUM_NODES=2
# Number of GPUs per node
NUM_GPUS=2

deepspeed --num_nodes=${NUM_NODES} --hostfile /dli/minGPT/minGPT/hostfile --num_gpus=${NUM_GPUS} /dli/minGPT/minGPT/runStep4.py \
    --deepspeed \
    --deepspeed_config /dli/minGPT/minGPT/ds_config_step4.json

Once you have done the above, please run the with the below command to submit the training job to the slurm scheduler.

In [None]:
!sbatch /dli/minGPT/minGPT/runSlurmStep4.sh
!squeue

Verify the execution of your code using the below (you should see it progress despite the large batch size):

In [None]:
!JOB_ID=TODO_ENTER_JOB_ID;cat /dli/megatron/logs/$JOB_ID.out

In [None]:
!JOB_ID=TODO_ENTER_JOB_ID;cat /dli/megatron/logs/$JOB_ID.err

Don't forget to cancel execution of your batch job once you are happy.

In [None]:
!squeue

In [None]:
!scancel  #PASTE_JOB_ID_HERE

### Further optimization consideration 
All workers participating in the training process are generating the same output. Thus, the k-means is computed twice. 
It is possible to ajust the k-means implmentation to execute it just once and with a redistribution of the results across all of the workers. 
Bellow an example on how to do it:

```import torch.distributed as dist
def run_kmeans(x, ncluster, niter=8, rank, size):
    print('KMeans executed on rank ', rank, ' Worlds size ', size)
    N, D = x.size()
    c = x[torch.randperm(N)[:ncluster]] # init clusters at random
    c = c.cuda(args.local_rank) # move the tensor to the GPU for exchange
    if rank == 0:
        # Computing KMeans only on rank 0 
        with torch.no_grad():
            c = kmeans(x, ncluster, niter)
    # We now have computed the clusters so can proceed to the exchange
    dist.barrier()
    print('Broadcasting')
    dist.broadcast(C.cuda(args.local_rank), src=0)
    c=c.cpu()
    print('Rank ', rank, ' has data ', C.size())
    return c

C=run_kmeans(px, ncluster, niter=8, dist.get_rank(), dist.get_world_size())    
```


## Step 5: Scaling up

Now that we have a minimal functional implemented, let's scale out the training job. In this part of the assessment, we will make the model substantially bigger. 


&nbsp; &nbsp; 1. Scale the model's architecture   
&nbsp; &nbsp; 2. Create the DeepSeed configuration file enabeling activation checkpointing, FP16 training, ZeRO optimizer     
&nbsp; &nbsp; 3. Create and run the sbatch training file  

### 1. Scale the model's architecture
Before modifying the training script, let's start by making a copy to be modified:

In [None]:
!cp /dli/minGPT/minGPT/runFirstDeepSpeed.py /dli/minGPT/minGPT/runStep5.py

**TODO**: Adjust the number of layers of the VisionTransformers to **24** by modifying the [runStep5.py](./minGPT/minGPT/runStep5.py) on the "GPTConfig" section where the architecture of the neural network dimensions is defined as: 
```
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size,
                  embd_pdrop=0.0, resid_pdrop=0.0, attn_pdrop=0.0,
                  n_layer=12, n_head=8, n_embd=256)
```


### 2. Create the DeepSeed configuration file enabling activation checkpointing, FP16 training, ZeRO optimizer

Alter [ds_config_step5.json](./minGPT/minGPT/ds_config_step5.json) to reconfigure be enabling:
- Gradient accumulation and execute 4 accumulation steps to increase the global batch size (which is frequently needed to maintain fixed hyperparameters).
- Activation checkpointing to create 24 rather than 12 checkpoints
- FP16 training
- ZeRo Stage 3 optimizer with CPU offload for both parameters and optimizer states. Check the [ZeRO documentation](https://deepspeed.readthedocs.io/en/latest/zero3.html) for more details. 

**Hint:** Notebook 6 in Lab 1 has information on the ZeRo Stage 3 optimizer.


In [None]:
%%writefile minGPT/minGPT/ds_config_step5.json
{
  "train_micro_batch_size_per_gpu": 128,
  "gradient_accumulation_steps": #FIXME,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 3e-4
    }
  },
  "gradient_clipping": 1.0,
  "activation_checkpointing": {
    "partition_activations": #FIXME,
    "cpu_checkpointing": #FIXME,
    "contiguous_memory_optimization": #FIXME,
    "number_checkpoints": #FIXME,
    "synchronize_checkpoint_boundary": #FIXME,
    "profile": #FIXME
    },
   "fp16": {
    "enabled": #FIXME
    },
    "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": #FIXME,
    "stage3_max_reuse_distance": #FIXME,
    "stage3_prefetch_bucket_size": #FIXME,
    "stage3_param_persitence_threshold": #FIXME,
    "reduce_bucket_size": #FIXME,
    "contiguous_gradients": #FIXME,
    "offload_optimizer": {
        "device": "cpu"
    },
    "offload_param": {
        "device": "cpu"
    }
  }
}

### 3. Create and run the sbatch training file 
Execute the next cell to generate the sbatch script for the step5 training. 


In [None]:
%%writefile ./minGPT/minGPT/runSlurmStep5.sh
#!/bin/bash
#SBATCH --job-name=dli_assessment_step5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1       
#SBATCH --cpus-per-task=32 ### Number of threads per task (OMP threads)
#SBATCH -o /dli/megatron/logs/%j.out
#SBATCH -e /dli/megatron/logs/%j.err

# Number of nodes
NUM_NODES=2
# Number of GPUs per node
NUM_GPUS=2

deepspeed --num_nodes=${NUM_NODES} --hostfile /dli/minGPT/minGPT/hostfile --num_gpus=${NUM_GPUS} /dli/minGPT/minGPT/runStep5.py \
    --deepspeed \
    --deepspeed_config /dli/minGPT/minGPT/ds_config_step5.json

Once you have made the above changes please execute your job with the below command:

In [None]:
!sbatch /dli/minGPT/minGPT/runSlurmStep5.sh
!squeue

Verify the execution of your code using the below:

In [None]:
!JOB_ID=TODO_ENTER_JOB_ID;cat /dli/megatron/logs/$JOB_ID.out

In [None]:
!JOB_ID=TODO_ENTER_JOB_ID;cat /dli/megatron/logs/$JOB_ID.err

Its really important that before you go to the next step you stop all of the executing and pending jobs or evaluation will faill!

In [None]:
!squeue

In [None]:
!scancel  #PASTE_JOB_ID_HERE

## Step 6: Evaluate

If you have implemented all of the changes listed above, please provide the job ID verified in Step 5 in the code block below. If the challenges were completed correctly, an "Assessment Passed!" message will appear. Good luck!

In [None]:
from run_assessment import run_assessment
job_id = #PASTE_JOB_ID_HERE
run_assessment(job_id)

Once "Assessment Passed!" appears, please go back to the DLI portal and press the assess button. This will generate a certificate. Congratulations!