## System Setup

In [1]:
!nvidia-smi

Thu Oct 26 19:51:26 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB           On | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0               57W / 400W|      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB           On | 00000000:0F:00.0 Off |  

In [2]:
%env CUDA_VISIBLE_DEVICES=2,3,4,5,6,7

env: CUDA_VISIBLE_DEVICES=2,3,4,5,6,7


## Imports

In [3]:
import transformers as tr
from datasets import load_dataset
import os

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load the dataset

We will use the [Rotten Tomatoes sentiment classification dataset](https://huggingface.co/datasets/rotten_tomatoes) for this session. It has binary labels for sentiment (positive or negative) and is a good dataset to demonstrate how to train a model on a small dataset.

In [4]:
rotten = load_dataset("rotten_tomatoes")  # load the dataset
rotten

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [5]:
rotten['train'].to_pandas()

Unnamed: 0,text,label
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1
...,...,...
8525,any enjoyment will be hinge from a personal th...,0
8526,if legendary shlockmeister ed wood had ever ma...,0
8527,hardly a nuanced portrait of a young woman's b...,0
8528,"interminably bleak , to say nothing of boring .",0


## 2. Load the model and pre-process our dataset using its tokenizer

We will be using the `t5-small` model, as it is small enough for this demo session. Larger models will take longer to train, and will also require more memory. As with usual deep learning tasks, we will use the same tokenizer used for pre-training to pre-process our dataset. `transformers` provides a `AutoTokenizer` class that will automatically select the correct tokenizer for the model we are using. For loading the model, `transformers` provides a `AutoModelForSeq2SeqLM` class that will automatically fetch the correct model.

In [6]:
model_checkpoint = "t5-small"

# model = tr.AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)  # Alternative way to initialize the model
model = tr.T5ForConditionalGeneration.from_pretrained(model_checkpoint)  # Initialize the model

# tokenizer = tr.AutoTokenizer.from_pretrained(model_checkpoint)  # Alternative way to initialize the tokenizer
tokenizer = tr.T5Tokenizer.from_pretrained(model_checkpoint)  # Initialize the tokenizer

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
classes = {0: "negative", 1: "positive"}  # Sentiment classes

def map_fn(data):  # Function to tokenize the dataset
    # Convert the dataset to a tokenized dataset, and remove the columns that are not needed anymore
    return tokenizer(
            data['text'],  # Tokenize the text
            text_target=[classes[label] for label in data['label']],  # Convert 0/1 to "negative"/"positive" text labels
            truncation=True,  # Truncate the inputs if they are too long
            padding=True,  # Pad the inputs if they are too short
            return_tensors='np'  # Return NumPy tensors
        )

# Tokenize the dataset
rotten_tokenized = rotten.map(  # Maps the tokenize function to each split of the dataset
    map_fn, 
    batched=True,  # Batch the outputs
    remove_columns=['text', 'label']  # Remove the untokenized columns
)  # Tokenize the dataset

Map: 100%|██████████| 8530/8530 [00:01<00:00, 7348.41 examples/s]
Map: 100%|██████████| 1066/1066 [00:00<00:00, 7304.97 examples/s]
Map: 100%|██████████| 1066/1066 [00:00<00:00, 7258.78 examples/s]


## 3. Setup the training
Hugging face has a [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) class that makes training models easy. It takes care of logging, checkpointing, and other bookkeeping tasks. It also has a `TrainingArguments` class that is used to configure the training job. 

By default, Hugging Face uses all the visible GPUs for training, which we set using the `CUDA_VISIBLE_DEVICES` environment variable. Earlier in this notebook, we set this to `1,2,3,4,5,6` to use all the GPUs on the system except for GPU 0 and 7. 

In [8]:
# Define params
checkpoint_dir = "checkpoints"  # Directory to save the checkpoints to
run_name = "t5-small-rotten-tomatoes"  # A name for the current training run
epochs = 3  # Number of training epochs
batch_size = 128  # Batch size
optimizer = "adamw_torch"  # Optimizer to use. Adam with weight decay is a good standard choice
# tensorboard_dir = "tensorboard"  # Directory to save TensorBoard logs to

# Create the directories
checkpoint_path = os.path.join(checkpoint_dir, run_name)
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)
# if not os.path.exists(tensorboard_dir):
#     os.makedirs(tensorboard_dir)

# Define the training arguments
training_args = tr.TrainingArguments(
    checkpoint_path,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    optim=optimizer,
    # fp16=True,  # Use FP16 training
    # bf16=True,  # Use BF16 training (only on Ampere GPUs like the A100)
)

In [9]:
# Initialize the trainer
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)  # Initialize the data collator (data collator does the same thing as a data loader in PyTorch)
trainer = tr.Trainer(
    model,
    training_args,
    train_dataset=rotten_tokenized['train'],  # Pass in the tokenized training dataset
    eval_dataset=rotten_tokenized['validation'],  # Pass in the tokenized validation dataset
    data_collator=data_collator,  # Pass in the data collator
    tokenizer=tokenizer,  # Pass in the tokenizer
)

## 4. Start Training

In [10]:
# Initialize tensorboard
# %load_ext tensorboard
# %tensorboard --logdir tensorboard  # In VSCode, simply click on Launch TensorBoard and select the tensorboard directory

In [11]:
trainer.train()  # Train the model

trainer.save_model()  # Save the model



Step,Training Loss


In [12]:
!nvidia-smi

Thu Oct 26 19:52:17 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB           On | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0               56W / 400W|      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB           On | 00000000:0F:00.0 Off |  

## 5. Predict using the trained model

In [13]:
trained_model = tr.AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path)  # Load the trained model

In [14]:
reviews = [
    """
This movie is a waste of time. The acting is terrible. The plot is ridiculous. I will never watch this movie again.
""",
"""
This movie is really good. The acting is great. The plot is interesting. I will definitely watch this movie again.
""",
"""
movie good but bad 
"""
]

# Tokenize the reviews
tokenized_reviews = tokenizer(
    reviews,
    truncation=True,
    padding=True,
    return_tensors='pt'
)

# Generate the sentiment labels
pred_labels = trained_model.generate(
    input_ids=tokenized_reviews['input_ids'].to(trained_model.device),  # Convert the input to PyTorch tensors
    attention_mask=tokenized_reviews['attention_mask'].to(trained_model.device),  # Convert the input to PyTorch tensors
)

# Decode the labels
tokenizer.batch_decode(pred_labels, skip_special_tokens=True)  # Decode the labels



['Dieser Film ist eine Verschwendung von Zeit. Die Handlung ist schrecklich.',
 'This movie is really good. The acting is great. The plot is interesting.',
 'movie good but bad']

## 6. DeepSpeed accelerator

We can further speed up training by using the [DeepSpeed](https://www.deepspeed.ai/) library. As models grow larger and larger, we need to optimally utilise every single computational resource available at our disposal. DeepSpeed provides a number of features that help us do this. We will be using the [ZeRO-Offload](https://www.deepspeed.ai/features/#zero-offload) feature of DeepSpeed to train our model. ZeRO-Offload allows us to train models that are larger than the GPU memory by offloading the optimizer states to the host memory. This allows us to train models that are much larger than the GPU memory.

We can work with deepspeed using hugging face, by specifying a deepspeed configuration. There are a lot of options here that can be configured to obtain the best performance for your setup. You can read more about them [here](https://www.deepspeed.ai/docs/config-json/). We will be using the following configuration for this session:

In [15]:
# Define the DeepSpeed zero optimization config
zero_config = {
    "train_batch_size": "auto",  # train_batch_size is the global batch size
    "train_micro_batch_size_per_gpu": "auto",  # train_micro_batch_size_per_gpu is the per-GPU batch size
    # train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs

    # ZeRO parameters
    "zero_optimization": {
        # "stage": 2,  # Enable ZeRO Stage 2
        # "offload_optimizer": {"device": "cpu", "pin_memory": True},  # Offload the optimizer to the CPU  
        
        'stage': 3,
        'stage3_gather_16bit_weights_on_model_save': True,  # Gather 16-bit weights to the full precision model during save (allows saving mixed precision models)

        "contiguous_gradients": True,  # Enable contiguous gradients, which improves performance by reducing memory fragmentation
        "overlap_comm": True,
    },

    # Optimizer
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto",
            "torch_adam": True,
        },
    },

    # Scheduler
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
        }
    }
}

Description of some of the parameters in `zero_optimization`:
- `stage`: Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.
- `offload_optimizer`: Whether to offload the optimizer state to the host memory (CPU/NVMe). This frees up GPU memory for larger models. You can also offload the model parameters to the host memory by using the `offload_param` parameter (only available in stage 3).
- `overlap_comm`: Attempts to overlap the gradient reduction with the backward pass. This can help speed up training and reduce idle time.
- `contiguous_gradients`: Whether to use contiguous memory for gradients. This can help speed up training by reducing the memory access time.

Let's now train t5-small again, but this time using deepspeed.

In [16]:
model_checkpoint = "t5-small"

model = tr.AutoModelForSeq2SeqLM.from_pretrained(
    model_checkpoint,
)

# Define params
checkpoint_dir = "checkpoints"  # Directory to save the checkpoints to
run_name = "t5-small-rotten-tomatoes-deepspeed"  # A name for the current training run
checkpoint_path = os.path.join(checkpoint_dir, run_name)

# Define the training arguments
training_args = tr.TrainingArguments(
    checkpoint_path,
    num_train_epochs=epochs,
    
    # per_device_train_batch_size=batch_size,
    per_device_train_batch_size=batch_size * 2,  # Double the batch size

    deepspeed=zero_config,  # Pass in the ZeRO config
    # fp16=True,  # Use FP16 training
    # bf16=True,  # Use BF16 training (only on Ampere GPUs like the A100)
)



# Initialize the trainer
data_collator = tr.DataCollatorWithPadding(tokenizer=tokenizer)  # Initialize the data collator (data collator does the same thing as a data loader in PyTorch)
trainer = tr.Trainer(
    model,
    training_args,
    train_dataset=rotten_tokenized['train'],  # Pass in the tokenized training dataset
    eval_dataset=rotten_tokenized['validation'],  # Pass in the tokenized validation dataset
    data_collator=data_collator,  # Pass in the data collator
    tokenizer=tokenizer,  # Pass in the tokenizer
)

In [17]:
# Only works with single GPUs! Use deepspeed launcher (CLI) for multi-GPU training
# trainer.train()
# trainer.save_model()

To utilise deepspeed with more than one GPU, we need to use the `deepspeed` CLI utility, made available to use when we installed deepspeed. This utility takes care of setting up the environment variables and other things required to run deepspeed. Currently, the Jupyter notebook only supports single GPU training. I have put all the relevant code from this notebook into a separate python script, `deepspeed_demo.py`, which we will now run. I have also switched to using the `t5-base` model, which is a slightly larger model at 220 million parameters.

We can use `deepspeed` to run our training script as follows:

In [18]:
# Run the command in terminal, after restarting this kernel to free up GPU memory
# Change the number of GPUs to the number of GPUs you have
# deepspeed --num_gpus=6 deepspeed_demo.py

## References
- [Hugging Face DeepSpeed Integration Docs](https://huggingface.co/docs/transformers/main_classes/deepspeed)
- [LLMs with Hugging Face - DataBricks Academy | Kaggle](https://www.kaggle.com/code/aliabdin1/llm-04a-fine-tuning-llms)
- [DeepSpeed ZeRO Tutorial](https://www.deepspeed.ai/tutorials/zero/)
- [HuggingFace Transformers](https://huggingface.co/transformers/)
- [HuggingFace Trainer](https://huggingface.co/transformers/main_classes/trainer.html)