<img src="./images/DLI_Header.png" style="width: 400px;">

# 2.0 Multi-GPU Training Strategies

In this notebook, we will introduce the basic knowledge of distributed training strategies and experiments with [NeMo Framework](https://github.com/NVIDIA/NeMo/), NVIDIA conversational AI toolkit built for researchers working on automatic speech recognition (ASR), text-to-speech synthesis (TTS), large language models (LLMs), and natural language processing (NLP).

### Learning Objectives

The goals of this notebook are to:
* Understand the mechanisms behind distributed training strategies
* Run a simple distributed training using NeMo Framework scripts on 1 Node with data parallel distribution
* Understand the basic outputs of NeMo Framework logs

**[2.1 Introduction to Distributed Training Strategies](#2.1-Introduction-to-Distributed-Training-Strategies)<br>**
**[2.2 Single GPU Training Execution of NeMo GPT Pretraining](#2.2-Single-GPU-Training-Execution-of-NeMo-GPT-Pretraining)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[2.2.1 Check The GPT pretraining Script](#221-check-the-gpt-pretraining-script)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2.2 Run the GPT pretraining Script](#222-run-the-gpt-pretraining-script)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2.3 Understanding NeMo Framework Execution Logs](#223-understanding-nemo-framework-execution-logs)<br>
**[2.3 Multi-GPU Training Execution of NeMo GPT Pretraining](#2.3-Multi-GPU-Training-Execution-of-NeMo-GPT-Pretraining)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[2.3.1 Exercise: NeMo GPT pretraining execution on 2 GPUs](#2.3.1-Exercise:-NeMo-GPT-pretraining-execution-on-2-GPUs)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3.2 Understanding Multi-GPU NeMo Framework Execution Logs](#223-understanding-nemo-framework-execution-logs)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3.3 Model Distribution Considerations ](#233-model-distribution-considerations)<br>

### Cancel Previous Running/Pending Jobs

Before moving on, check that no jobs are still running or waiting on the SLURM queue. Let's check the SLURM jobs queue by executing the following cell:

In [1]:
# Check the SLURM jobs queue 
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


If there are still jobs running or pending, execute the following cell to cancel all the user's jobs using the `scancel` command. 

In [2]:
# Cancel root user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


---
# 2.1 Introduction to Distributed Training Strategies
In distributed training mode, the goal is to split the training process across multiple machines. The most commonly used distribution strategies are **data** and **model** parallelism.


## Data Distribution 

Neural Networks are usually trained using [Stochastic Gradient Descent algorithms](https://developer.nvidia.com/blog/a-data-scientists-guide-to-gradient-descent-and-backpropagation-algorithms/) consisting of splitting the dataset into batches processed sequentially. At the forward step, the feature maps are computed while at the backward pass, gradients are computed and averaged to determine the parameter updates. Finally, the next batch of data is processed once the model's parameters are updated.

<img src="images/data_parallel.png" width="600"/>

In data parallelism mode, the data is split across multiple machines, each will be processed by a copy of the same neural network hosted by each machine. Parameter updates are averaged from all machines and model updates are reflected on all copies.

Since, with more processors (or alternatively higher data parallelism), the time duration of an epoch (i.e. entire dataset) reduces, this has the effect of speeding up training. Also, because the updated gradients have effectively seen a larger number of samples (due to increased global batch size), this has a positive effect on the time to convergence. The time taken per batch is still the same, with the added cost of gradient exchange communication. 

There are different strategies for implementing the exchange of gradients: 
- **Centralized** way, where a server machine is responsible for distributing data chunks, accumulating gradients, and updating model parameters. 
- **Decentralized** way, where each worker sends and collects gradients from others to aggregate and update the model’s parameters locally. In addition, the workers can deliver the computation at different speeds. So, model parameters can be updated in a synchronous way based on the worker's synchronization points. We can also use a relaxed strategy, allowing workers to operate with outdated parameters. This strategy may introduce inconsistency during the training.

Several libraries offer data parallelism implementations such as [Horovod](https://github.com/horovod/horovod) which is compatible with several Deep Learning Frameworks such as TensorFlow, Keras, PyTorch, Apache MXNet. [NVIDIA APEX](https://nvidia.github.io/apex/) is a Pytorch extension library that offers utilities to streamline distributed training and Mixed Precision.




## Model Distribution Strategies

Model parallelism is the process of splitting a model’s parameters across multiple machines. This allows training bigger models that do not fit into 1 GPU, with the cost of additional communications due to feature maps exchange. 

We can distinguish 2 types of model distribution:

### Pipeline Parallelism


<img src="images/pipeline_parallel1.png" width="600"/>

Pipeline Parallelism is the process of cutting a model sequentially into pieces and assigning each part to a specific worker. For instance layers 1,2 on device_1 and 3,4 on device_2, and so on. 

There are different pipelining strategies such as the Micro-batch pipeline parallelism [GPipe](https://arxiv.org/pdf/1811.06965.pdf), which is an optimized implementation of model pipelining to minimize the time machines wait for their peers to communicate their outputs. It consists in partitioning data chunks into micro batches enabling different machines to process different micro batches simultaneously.

![title](images/pipeline_parallel.png)

Instead of a single sequential set of layers per device, the Interleaved pipeline parallelism assigns multiple pipeline stages per device, each with less computation. 
For instance, layers 1,2 and 9,10 on device_1, layers 3,4 and 11,12 on device_2, and so on. 


### Tensor Parallelism

<img src="images/tensor_parallel1.png" width="500"/>


Tensor Parallelism is the process of dividing matrix operations across workers. 

[NeMo Framework](https://github.com/NVIDIA/NeMo/) is NVIDIA's open-source library for efficient training of transformer-based networks and multi-node pre-training of transformer-based models such as GPT, BERT, and T5 using mixed precision with a focus on conversational AI.

<img src="images/tensor_parallel.png" width="250"/>

NeMo Framework is built on top of PyTorch and it integrates data, pipeline and tensor parallelism for pre-training of GPT and BERT transformers architectures using mixed precision. In this lab, we will be using implemented distribution strategies provided by NeMo Framework. In particular, we will focus on 1 Node execution on 1 or 2 devices and in the second case, we will use data, tensor and pipeline parallelism. Several examples can be found in the [NeMo Framework repository](https://github.com/NVIDIA/NeMo/tree/main/examples). 


---
# 2.2 Single GPU Training Execution of NeMo GPT Pretraining 

Let's first get familiarized with a simple NeMo GPT execution script. 

For distributed training mode, it's sufficient to use Python. There's no need to use the PyTorch distributed launcher ([torchrun](https://pytorch.org/docs/stable/elastic/run.html) or [torch.distributed](https://pytorch.org/tutorials/beginner/dist_overview.html)) even for multi-node jobs, as NeMo Framework handles multi-node communication automatically.

The resources are configured with the arguments `trainer.num_nodes` and `trainer.devices` which specify respectively the number of nodes and GPUs per node to use.

In this section, we will run a simple NeMo GPT Pretraining Execution on 1 GPU by running the NeMo [megatron_gpt_pretraining.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py) script. We will use an [example training config](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml) for GPT models as foundation. We will overwrite some arguments using [Hydra](https://hydra.cc/docs/intro/), which allows to perform on the fly config edits from the command line. You can find some of the config parameters on the following image:

<img src="images/nemo_run.png" width="600"/>

Distributed strategies are configured with the arguments `model.tensor_model_parallel_size` and `model.pipeline_model_parallel_size`
Learn more about the distributed strategies arguments in the [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/parallelisms.html).

We have prepared the script [pretrain_gpt_1GPU.sh](./code/pretrain_gpt_1GPU.sh) that will run GPT pretraining on only 1 GPU (with no distribution strategy applied).

This script assumes that the compute resources are already allocated. Thus, for the execution, we will need to first allocate the required GPU by connecting to a worker node in an interactive session.

## 2.2.1 Check the GPT Pretraining Script

Let's have a look at the script before allocating the resources and executing it. 

Notice the model architecture and training arguments. 

In [3]:
# Have a look at the NeMo GPT pretraining execution on 1 GPU script
!cat /dli/code/pretrain_gpt_1GPU.sh

#!/bin/bash

# Distributed training args
NNODES=1
GPUS_PER_NODE=1
TP_SIZE=1
PP_SIZE=1

# Distributed training 
MICRO_BATCH_SIZE=16
GLOBAL_BATCH_SIZE=16

# Model architecture 
NLAYERS=12
NHIDDEN=768
NHEADS=32
SEQ_LEN=1024

# Data Paths
VOCAB_FILE=/dli/data/GPT-2_assets/gpt2-vocab.json
MERGE_FILE=/dli/data/GPT-2_assets/gpt2-merges.txt
DATA_PATH=[1.0,/dli/data/GPT-2_assets/my-gpt2_text_document]

OUTPUT_PATH=/dli/nemo
LOGS_PATH=/dli/nemo/logs
NAME="1Node1GPU"


OPTIMIZER_ARGS=" \
            model.optim.name=fused_adam \
            model.optim.betas=[0.9,0.95] \
            model.optim.lr=6e-5 \
            model.optim.sched.min_lr=6e-6 \
            model.optim.sched.name=CosineAnnealing \
            +model.optim.sched.max_steps=800 \
            model.optim.sched.warmup_steps=80 \
            model.optim.weight_decay=1e-1 \
        "
        
TRAINER_ARGS=" \
            trainer.gradient_clip_val=1.0 \
            trainer.precision=32 \
            trainer.devices=$GPUS_PER_NODE \
   


## 2.2.2 Run the GPT Pretraining Script

Now, let's run the pretrain_gpt_1GPU.sh script in an interactive session. To do so, follow the 3 steps:
1. Launch a terminal session
2. Run an interactive session by executing `srun -N 1 --pty /bin/bash`
3. Run the NeMo GPT-3 pretraining on 1 GPU by executing `bash ./code/pretrain_gpt_1GPU.sh`


<img src="images/interactive_launch0.png" width="1050"/>

Run the following cell to generate a link to open a terminal session and the instructions to run interactive session. Then, submit a GPT pretraining job on 1 GPU.

In [1]:
%%html

<pre>
   Step 1: Open a terminal session by following the <a href="", data-commandlinker-command="terminal:create-new">Terminal link</a>
   Step 2: Run an interactive session: <font color="green">srun -N 1 --pty /bin/bash</font>
   Step 3: Run the NeMo gpt3 pretraining on 1 GPU: <font color="green">bash ./code/pretrain_gpt_1GPU.sh</font>
</pre>


While the GPT pretraining on 1 GPU is running. We can check the SLURM queue by running this cell:

In [5]:
# Check the SLURM queue
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                10  slurmpar     bash     root  R       0:31      1 slurmnode1


We can also check the GPUs using the `nvidia-smi` command. We should see only GPU 0 utilized as shown in the figure bellow. Note that GPU types and memory can be different, depending on configuration of your course environment.

<img src="images/1N_1gpu_utilization.png" width="650"/>

In [6]:
# Check GPU utilization on the master node after NeMo starts running
!sleep 60s
!nvidia-smi

Thu Mar 21 21:26:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   58C    P0             327W / 300W |  79613MiB / 81920MiB |     93%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000002:00:00.0 Off |  

## 2.2.3  Understanding NeMo Framework Execution Logs 

To understand the performance of the GPT pretraining, we can check the generated [log file](./nemo/logs/log_1GPU.txt) during the execution.

In [7]:
# Check the NeMo GPT3 pretraining logs.
!cat /dli/nemo/logs/1Node1GPU.txt | tail -3

Epoch 0:  59%|██████████████████▉             | 89/150 [01:00<00:41,  1.47it/s, loss=9.58, v_num=, reduced_train_loss=9.300, global_step=59.00, consumed_samples=944.0, val_loss=9.820][A
Epoch 0:  60%|███████████████████▏            | 90/150 [01:00<00:40,  1.48it/s, loss=9.58, v_num=, reduced_train_loss=9.300, global_step=59.00, consumed_samples=944.0, val_loss=9.260][A
                                                                                                                                                                                       [AEpoch 0, global step 60: 'val_loss' reached 9.26304 (best 9.26304), saving model to '/dli/nemo/1Node1GPU/checkpoints/megatron_gpt--val_loss=9.26-step=60-consumed_samples=944.0.ckpt' as top 10


From the extract, the outputs should be similar to:

```
Epoch 0: 100%|█| 150/150 [01:44<00:00,  1.43it/s, loss=8.52, v_num=, reduced_train_loss=8.56
```   

In this example, notice the training speed of 1.43 it/s, which is equal to 22.88 samples/s with a batch size of 16.


Great! Before moving on, let's release some disk space by deleting the unnecessary checkpoints and logs generated by the previous execution and cancel the remaining interactive session.

In [8]:
# Clean the checkpoints and logs
!rm -rf /dli/nemo/1Node1GPU/checkpoints
!scancel -u $USER

----

# 2.3 Multi-GPU Training Execution of NeMo GPT Pretraining

Let's now execute the same previous training job while using the 2 GPUs available in the interactive session. 

For that, we need to set the number of devices per node to `trainer.devices=2`. 

The first distribution strategy we will experiment with is the data parallel distribution strategy, which is executed by default with NeMo when several resources are available.
             
In the previous execution on one single GPU, the batch size processed by the GPU was 16 (set by `model.micro_batch_size`) which also corresponds to the global batch size (set by `model.global_batch_size`). 


## 2.3.1 Exercise: NeMo GPT pretraining execution on 2 GPUs
Let's configure the new NeMo GPT pretraining execution on 2 GPUs using data parallel distribution by modifying the `#FIXME` in the following cell. 

To use 2 GPUs, we can keep the micro batch size per GPUs set to 16 and thus double the global batch size to 32. If you get stuck, feel free to look at the [solution](./solutions/ex2.3.ipynb).

Please notice that we will change the logfile name for each run (*1Node2GPUS.txt* in the following example).

In [9]:
%%writefile /dli/code/pretrain_gpt_2GPU.sh
#!/bin/bash

# Distributed training args
NNODES=1
GPUS_PER_NODE=2         # <--- CHANGE HERE
TP_SIZE=1
PP_SIZE=1

# Distributed training 
MICRO_BATCH_SIZE=16
GLOBAL_BATCH_SIZE=32    # <--- CHANGE HERE

# Model architecture 
NLAYERS=12
NHIDDEN=768
NHEADS=32
SEQ_LEN=1024

# Data Paths
VOCAB_FILE=/dli/data/GPT-2_assets/gpt2-vocab.json
MERGE_FILE=/dli/data/GPT-2_assets/gpt2-merges.txt
DATA_PATH=[1.0,/dli/data/GPT-2_assets/my-gpt2_text_document]

OUTPUT_PATH=/dli/nemo
LOGS_PATH=/dli/nemo/logs
NAME="1Node2GPUS"      


OPTIMIZER_ARGS=" \
            model.optim.name=fused_adam \
            model.optim.betas=[0.9,0.95] \
            model.optim.lr=6e-5 \
            model.optim.sched.min_lr=6e-6 \
            model.optim.sched.name=CosineAnnealing \
            +model.optim.sched.max_steps=800 \
            model.optim.sched.warmup_steps=80 \
            model.optim.weight_decay=1e-1 \
        "
        
TRAINER_ARGS=" \
            trainer.gradient_clip_val=1.0 \
            trainer.precision=32 \
            trainer.devices=$GPUS_PER_NODE \
            trainer.num_nodes=$NNODES \
            trainer.max_steps=100 \
            trainer.enable_model_summary=true \
            trainer.log_every_n_steps=10 \
            trainer.val_check_interval=20 \
            trainer.limit_val_batches=10 \
        "

GPT_ARGS=" \
            model.num_layers=$NLAYERS \
            model.hidden_size=$NHIDDEN \
            model.num_attention_heads=$NHEADS \
            model.encoder_seq_length=$SEQ_LEN \
            model.data.seq_length=$SEQ_LEN \
            model.max_position_embeddings=$SEQ_LEN \
            model.micro_batch_size=$MICRO_BATCH_SIZE \
            model.global_batch_size=$GLOBAL_BATCH_SIZE \
            model.tokenizer.vocab_file=$VOCAB_FILE \
            model.tokenizer.merge_file=$MERGE_FILE \
            model.init_method_std=0.006 \
            $OPTIMIZER_ARGS \
        "

OUTPUT_ARGS=" \
            exp_manager.explicit_log_dir=$OUTPUT_PATH/$NAME \
            exp_manager.resume_if_exists=false \
            exp_manager.name=$NAME \
        "

PARALLEL_ARGS=" \
            model.tensor_model_parallel_size=$TP_SIZE \
            model.pipeline_model_parallel_size=$PP_SIZE \
        "


export CMD=" \
            python /dli/code/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
            --config-path=/dli/code/NeMo/examples/nlp/language_modeling/conf/ \
            --config-name=megatron_gpt_config.yaml \
            $TRAINER_ARGS \
            $PARALLEL_ARGS \
            $GPT_ARGS \
            $OUTPUT_ARGS \
            model.data.data_prefix=$DATA_PATH \
            model.data.data_impl=mmap \
            model.data.splits_string=\"949,50,1\" \
        "

bash -c '$LAUNCHER $CMD' 2>&1 | tee -a $LOGS_PATH/$NAME.txt

Overwriting /dli/code/pretrain_gpt_2GPU.sh



Now let's run this script in an interactive session. To do so, follow the 3 steps:
1. Launch a terminal session
2. Run an interactive session by executing `srun -N 1 --pty /bin/bash`
3. Run the NeMo gpt3 pretraining on 2 GPUs by executing `bash ./code/pretrain_gpt_2GPU.sh`

Run the following cell to get the link to open a terminal session and the instructions to run an interactive session. Then, submit a pretraining job on 2 GPUs.

In [10]:
%%html

<pre>
   Step 1: Open a terminal session by following the <a href="", data-commandlinker-command="terminal:create-new">Terminal link</a>
   Step 2: Run an interactive session: <font color="green">srun -N 1 --pty /bin/bash</font>
   Step 3: Run the NeMo gpt3 pretraining on 2 GPUs: <font color="green">bash ./code/pretrain_gpt_2GPU.sh</font>
</pre>

While the GPT pretraining on 1 Node and 2 GPUs is running, we can check the SLURM queue.

In [11]:
# Check the SLURM queue
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                11  slurmpar     bash     root  R       0:19      1 slurmnode1


We can also Check the GPUs using the `nvidia-smi` command. We should see GPU 0 and 1 utilized as shown in the figure bellow.

<img src="images/1N_2gpus_utilization.png" width="650"/>

In [12]:
# Check GPU utilization on the master node
!sleep 60s
!nvidia-smi

Thu Mar 21 21:29:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000001:00:00.0 Off |                    0 |
| N/A   42C    P0              74W / 300W |  46169MiB / 81920MiB |      4%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000002:00:00.0 Off |  

## 2.3.2 Understanding Multi-GPU NeMo Execution Logs

Let's have a look at the execution logs:

<img src="images/interactive_launch2.png" width="900"/>

We can see that two processes with GLOBAL_RANK 0 and 1 were initialized. As we have 2 GPUs available, by default, the distributed strategy executed is the data parallel strategy, meaning that the model is copied on both GPUs and will process different data batches. 

To understand the performance of the GPT pretraining on 2 GPUs, we can check the generated [log file](./nemo/logs/log_2GPU.txt) during the execution.

In [13]:
!cat /dli/nemo/logs/1Node2GPUS.txt | tail -3

Epoch 0:  59%|██████████████████▍            | 89/150 [01:26<00:59,  1.03it/s, loss=9.55, v_num=, reduced_train_loss=9.280, global_step=59.00, consumed_samples=1888.0, val_loss=9.800][A
Epoch 0:  60%|██████████████████▌            | 90/150 [01:26<00:57,  1.04it/s, loss=9.55, v_num=, reduced_train_loss=9.280, global_step=59.00, consumed_samples=1888.0, val_loss=9.220][A
                                                                                                                                                                                       [AEpoch 0, global step 60: 'val_loss' reached 9.21666 (best 9.21666), saving model to '/dli/nemo/1Node2GPUS/checkpoints/megatron_gpt--val_loss=9.22-step=60-consumed_samples=1888.0.ckpt' as top 10


From the extract logs, notice the training performance while using 2 GPUs compared to 1 GPU.

`Epoch 0: 100%|█| 150/150 [02:12<00:00,  1.13it/s, loss=8.48, v_num=, reduced_train_loss=8.50`
 
Notice the number of samples consumed, and the number of iterations per second. With a batch size of 32 1.13 it/s is equivalent to 36.16 samples/s. Notice also that this is an almost linear increase which is a desirable property in multi-GPU systems.  

Discuss the performance with the instructor. The major change here is larger number of samples processed in the same time duration, therefore helping the model learn richer data representations, speeding up the training.

Great, before moving on, let's release some disk space by deleting the unnecessary checkpoints generated by the previous execution and cancel the remaining interactive session.


In [14]:
# Clean the checkpoints
!rm -rf /dli/nemo/1Node2GPUS/checkpoints

## 2.3.3 Model Distribution Considerations 

To execute the previous Multi-GPU script in Tensor or Pipeline parallel mode, we can configure the distribution using the argument `model.tensor_model_parallel_size` or `model.pipeline_model_parallel_size`. 

The world size of NeMo GPT training corresponding to the number of GPUs will remain the same while the data_parallel_size, tensor_model_parallel_size and pipeline_model_parallel_size should be adjusted according to your configuration.

The world size is the product of data_parallel_size, tensor_model_parallel_size and pipeline_model_parallel_size. 

---
<h2 style="color:green;">Congratulations!</h2>

Great job with pretraining GPT-3 on a GPU cluster.<br>

Before moving on, we need to make sure that no jobs are still running or waiting on the SLURM queue. 
Let's check the SLURM jobs queue by executing the following cell:

In [15]:
# Check the SLURM jobs queue 
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                11  slurmpar     bash     root  R       2:32      1 slurmnode1


If there are still jobs running or pending, execute the following cell to cancel all the user's jobs using the `scancel` command. 

In [16]:
# Cancel admin user jobs
!scancel -u $USER

# Check again the SLURM jobs queue (should be either empty, or the status TS column should be CG)
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                11  slurmpar     bash     root CG       2:35      1 slurmnode1


Next, we will be running  GPT language model training on multi-nodes distribution configurations. Move on to [03_GPT_LM_pretrainings_multinodes.ipynb](03_GPT_LM_pretrainings_multinodes.ipynb).