# Efficiency, Efficiency, Efficiency

In this exercise, we'll take a crack at fine-tuning our very own LLM—hopefully increasing our efficiency as we go along.

To run these models, we'll need access to an NVIDIA GPU with ~10 GiB of VRAM (or equivalent), so please run this notebook in one of the following environments (sorted by ease-of-use):
* **Local Machine w/ GPU**: If you have your own machine with an appropriate GPU, just run it there.
* **Google Colab**: Import this notebook to Colab, and set your runtime environment to "T4 GPU". Google provides some GPU time for free, but there are random limitations, so you might get demoted to a non-GPU environment at some point. Requires a Google Account.
* **ITU's HPC**: You can set up Jupyter on a GPU node of the HPC and set up port-forwarding to your local machine ([official documentation](http://hpc.itu.dk/software/jupyternotebook/)).

Once your environment is set up, let's get started by installing all the packages we need:

In [1]:
!pip3 install -q -U bitsandbytes  # provides model quantization
!pip3 install -q -U peft  # provides hooks to turn our full model into a PEFT model
!pip3 install -q -U trl  # provides methods to train our model
!pip3 install -q -U accelerate  # provides methods for efficient GPU usage
!pip3 install -q -U datasets  # provides easy access to large datasets
!pip3 install -q -U transformers  # provides ready-to-run implementations of the most popular Transformer-based models

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Time to Bloom 🌸!

We'll be working with the Bloom series of models from the BigScience project ([BigScience, 2022](https://arxiv.org/abs/2211.05100)). Compared to the models from private companies such as LLaMA from Meta, or Gemma from Google, we know exactly what went into training the Bloom models, and can access them without having to accept these companies' privacy policies.

By default, we'll be using Bloom's 1 billion parameter version, but the exercise should work with either of the smaller or larger versions as well, depending on your GPU size. Note that the even the relatively "small" 1B version is already almost 100x larger than the BERT-style models, which you might have used in previous courses. More common sizes of what people call LLMs now range from around 7B to 200B parameters, so keep that in mind while we go through our exercises.

Anyway, let's load the model and see what happens:

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# uncomment the model you want to work with:
# MODEL_ID = 'bigscience/bloom-560m'
MODEL_ID = 'bigscience/bloom-1b1'
# MODEL_ID = 'bigscience/bloom-3b'

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


When you load the model for the first time, you might notice that it is loaded in "shards". This is to make file management easier, and is similar to how Transformers can be split up into multiple blocks. It is very common nowadays to have more than a dozen shards per model. After being downloaded, the shards are merged into one model in the machine's working memory.

Let's check out the model architecture:

In [4]:
print(model)

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1536)
    (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1536, out_features=4608, bias=True)
          (dense): Linear(in_features=1536, out_features=1536, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1536, out_features=6144, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=6144, out_features=1536, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
  )
  (

Notes:

- Blooms nodel have 250.880 unique words while gpt-3 has around 50.000. It is because that Bloom is trained on a lot of languages

### 📝 Exercise 1

Looks like a pretty standard Transformer architecture, right? Check your understanding of all components by answering the following questions:

1. What's the difference between `BloomForCausalLM` and `BloomModel`?
BloomModel is the base model. BloomForCausallM is the build on top of BloomModel for text generations (autoregressive) (next token prediction)

2. How large is the vocabulary of this model?
250.880

3. How many layers does this model have?
24 in total

4. What do Bloom's feed-forward layers look like? I.e., are latent vectors up/down-scaled, and which non-linear activation function is used?
dense_h_to_4h
gelu_impl
dense_4h_to_h

6. All good with some self-attention, and MLPs, but what is all this other stuff: `LayerNorm`, `Dropout`? Do you also need to train these?
LayerNorm: Normalizes input  / should be trained
Dropout: Neurons set randomly to zero to prevent overfitting / should not be trained

7. Bonus: What is happening with the query, key and value matrices in `query_key_value`? Why is this one parameter, and not three?
For optimization purpose.


1. BloomModel is the base model. BloomForCausallM is the build on top of BloomModel for text generations (autoregressive) (next token prediction)

2. 250.880

3. 4 in total

4. dense_h_to_4h, gelu_impl, dense_4h_to_h

6. LayerNorm: Normalizes input  / should be trained. Dropout: Neurons set randomly to zero to prevent overfitting / should not be trained

7. For optimization purpose.

Alright, now we got a good look at the model. Let's see how much space it's using up on the GPU:

In [5]:
!nvidia-smi

Thu Dec 26 12:07:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Huh? Considering that we made such a big deal about using GPUs this exercise, not much is happening on it 🧐

That's because by default, the model is loaded onto the CPU, and standard RAM. You can check where your model lives using the `*.device` property in PyTorch:

In [6]:
print(f"The model lives on the {model.device}.")

The model lives on the cpu.


Nothing against CPUs, but to go fast, we need to move over into CUDA-land on the GPU.

Let's do just that:

In [7]:
model = model.to('cuda')
print(f"Now the model lives on the {model.device} device.")

Now the model lives on the cuda:0 device.


Our model now lives on the 0th CUDA device, i.e., the one true GPU. We can verify this by directly checking the GPU usage:

In [8]:
!nvidia-smi

Thu Dec 26 12:07:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0              31W /  70W |   4215MiB / 15360MiB |     33%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Bloom should now be cozily occupying ~4GBs of VRAM on our GPU. Lots of breathing room. Easy!

To get a better idea of how the memory is being used, let's define a function to print out some stats regarding efficiency:

In [9]:
def print_model_statistics(model):
    num_parameters = sum(p.numel() for p in model.parameters())
    num_trainable_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
    cuda_total_memory = torch.cuda.get_device_properties(0).total_memory
    cuda_alloc_memory = torch.cuda.memory_allocated(0)
    print(f"{model.__class__.__name__} has:")
    print(f"  {num_parameters} total parameters")
    print(f"  {num_trainable_parameters} trainable parameters ({(num_trainable_parameters * 100)/num_parameters:.2f}%)")
    print(f"  {cuda_alloc_memory} / {cuda_total_memory} bytes of VRAM in use ({(cuda_alloc_memory*100)/cuda_total_memory:.2f}%)")

print_model_statistics(model)

BloomForCausalLM has:
  1065314304 total parameters
  1065314304 trainable parameters (100.00%)
  4286423040 / 15835660288 bytes of VRAM in use (27.07%)


This shows us in more detail, how all that VRAM is being used. We see that:

* There are indeed one biiillliiiiooon 🤙 total parameters, which we enumerate using `model.parameters()`.
* All of these parameters need to be trained, if we want to fine-tune the model. We count these by checking which of the above parameters require gradient computation, i.e., `p.requires_grad`.
* How much memory PyTorch says it is using. Note that this includes not just the model, but also other code, that PyTorch needs to run.

## Let's Get Promptin'

Now that we've gotten to know the model a bit better, let's do the generative AI thing: prompting.

To make things easier, here's a little helper function.

* It converts the input prompt into tokens, converts these tokens to numbers (i.e., token IDs) in the form of PyTorch tensors (`"pt"`), before moving them to the GPU. This is important, since the model lives on the GPU and needs to be able to access and work with this information.
* Given the inputs, it then passes them iteratively through the model to generate one new token at a time. To not wait forever, we limit the number of new tokens to 64 (you can change this to get longer answers, if you'd like).
* The outputs are also token IDs, so in the last step, we convert them back into strings by using the tokenizers `decode()` method. We skip special tokens, such as beginning of generation, end of generation, etc.

In [10]:
def generate_response(prompt, model):
    inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    outputs = model.generate(**inputs, max_new_tokens=64)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Alright, let's see what our model has to say!

*(Generation is compute intensive, and may take around 5–30 seconds.)*

In [11]:
generate_response("Tell me about the IT University of Copenhagen.", model)

Tell me about the IT University of Copenhagen. What is it about?
The IT University of Copenhagen is a unique university in Denmark. It is a unique university in the world. It is a unique university in Europe. It is a unique university in the world. It is a unique university in Europe. It is a unique university in the world. It is a unique


Bloom! 💥 Now we're talking!

Although the response sounds a little clunky, there's definitely... potential. Let's figure out what this model can do!

### 📝 Exercise 2

1. Try out if the model responds correctly to the prompt, "What is the capital of Denmark?".
2. The knowledge must be in there somewhere. Try to find a prompt, which results in the correct answer.
3. Why might one prompt work, while the other does not?
4. How good is the model's Danish?
5. Bonus: Knock yourselves and try out some more prompts. Note down what the model might be better/worse at.

In [12]:
#Q1
generate_response("What is the capital of Denmark?", model)

What is the capital of Denmark? Denmark is a country in the Baltic Sea, which is located in the northwestern part of Europe. Denmark is a country with a population of about 1.5 million people. Denmark is a country with a population of about 1.5 million people. Denmark is a country with a population of about 1.5 million people


In [13]:
#Q2
generate_response("Denmark the country has a capital named?", model)

Denmark the country has a capital named? The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The capital of Denmark is Copenhagen. The


In [1]:
#Q3
#The first is more open-ended, why the other is better for next-token prediction (predicting the proability for the next word)

In [15]:
#Q4
generate_response("Den danske hovedstad hedder?", model)

Den danske hovedstad hedder?", "Quel est le quartier le plus cher de Copenhague?", "Quel est le quartier le plus cher de Copenhague?", "Quel est le quartier le plus cher de Copenhague?", "Quel est le quartier le plus cher de Copenhague?", "Quel est le quartier le plus cher de Copenhague


In [16]:
# Actually answers the question, but in France. The model overall seems to vary very much on how you frame your prompt

## No Train; No Gain

The model obviously knows something, but it's hard to get useful answers from it. Actually, all LLMs start out this way, and they need additional training in-order to follow user instructions.

### Extractive Question Answering

Normally, we train autocomplete-style LLMs to become chat-style LLMs by applying instruction-tuning, multi-turn dialogue training, and alignment with human feedback. Unfortunately, we ain't got time for that (today). So, we're going to do something that is equivalent in-terms of method and format, but slightly smaller-scope: *extractive question answering*.

In essence, instead of training the LLM to answer a question directly (because we would be training days on end), we will ask a question, and provide a context, which contains the answer. This way, the model has a better chance of finding the correct answer.

To train the model, we first need the relevant training data, and luckily, we have the SQuAD v2 dataset ([Rajpurkar et al., 2018](https://aclanthology.org/P18-2124/)), which is publicly available on HuggingFace datasets.

In [17]:
from datasets import load_dataset

squad = load_dataset('squad_v2', split='train')
print(f"Loaded SQuAD v2 dataset:\n{squad}")

Loaded SQuAD v2 dataset:
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 130319
})


130,319 rows of training data goodness (note that we only load the train split). This should be enough to train our LLM, and we'll actually only need to use part of it. While this might seem like a lot, full instruction tuning datasets contain *millions* of rows. Quite the high barrier to entry for new languages, isn't it?

Now, what's actually in this dataset?

In [18]:
squad[238]

{'id': '56be9add3aeaaa14008c9152',
 'title': 'Beyoncé',
 'context': 'Her fourth studio album 4 was released on June 28, 2011 in the US. 4 sold 310,000 copies in its first week and debuted atop the Billboard 200 chart, giving Beyoncé her fourth consecutive number-one album in the US. The album was preceded by two of its singles "Run the World (Girls)" and "Best Thing I Never Had", which both attained moderate success. The fourth single "Love on Top" was a commercial success in the US. 4 also produced four other singles; "Party", "Countdown", "I Care" and "End of Time". "Eat, Play, Love", a cover story written by Beyoncé for Essence that detailed her 2010 career break, won her a writing award from the New York Association of Black Journalists. In late 2011, she took the stage at New York\'s Roseland Ballroom for four nights of special performances: the 4 Intimate Nights with Beyoncé concerts saw the performance of her 4 album to a standing room only.',
 'question': "Beyonce's fourth albu

The Queen herself! The format is pretty straightforward, but it's worth reflecting on how it came to 🐝. SQuAD v1 and v2 are based on Wikipedia, so:

* `title` is the title of the article;
* `context` is a paragraph from the article;
* `question` asks for information related to the article;
* `answers` highlights one or more passages in the context, which contain the answer.

Another cool thing about the second version of SQuAD in particular is the following:

In [19]:
squad[2075]

{'id': '5a8d7bf7df8bba001a0f9ab1',
 'title': 'The_Legend_of_Zelda:_Twilight_Princess',
 'context': 'The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.[b]',
 'question': 'What category of game is Legend of Zelda: Australia Twilight?',
 'answers': {'text': [], 'answer_start': []}}

This question is related to the Legend of Zelda, but the answer is not in the context (because the name of the game is wrong). In these cases there are no answer segments. This is to test whether the model knows when it cannot know.

### 📝 Exercise 3

Although it might look like natural language magic, when an LLM responds to an instruction, they are often trained to respond to instructions formatted in a specific format. For example, the popular Alpaca format is:

```
### Instruction:
{instruction}

### Input:
{input}

### Response:
{response}
```

When providing a prompt through a user interface, it is typically reformatted into the format the LLM is used to before being passed to the model. If this is not done, responses may be worse or unpredictable.

Let's come up with a good format for our extractive question answering task, by completing the formatting function below. Some pointers:
* As input, this function takes one dataset row at a time (what you saw above by accessing `squad[i]`).
* The formatted prompt should include all relevant information, i.e., context, question, and answer.
* The last part should be the answer, such that the model can generate the answer to a new question, by continuing the provided input.
* Each part should be easily differentiable from each other, using separators, which are unlikely to occurr in the input text.
* No need to make the format overly complex, as it takes longer to generate (and train) on longer inputs.
* Note that there should also be a default answer for when there is no answer (e.g., "No idea, sorry :(").

In [20]:
def format_instruction(row):

    # If the field is empty, it returns a default string
    context_ = row.get("context", 'No context provided')
    question_ = row.get("question", 'No question provided')

    answer_text = row.get("answers", {}).get("text", [])
    if not answer_text:
        answer_ = "No answer provided"
    else:
        answer_ = answer_text[0]

    prompt = (
        f"### Context: {context_} "
        f"### Question: {question_} "
        f"### Answer: {answer_} "
    )

    return prompt

In [21]:
format_instruction(squad[239])

'### Context: Her fourth studio album 4 was released on June 28, 2011 in the US. 4 sold 310,000 copies in its first week and debuted atop the Billboard 200 chart, giving Beyoncé her fourth consecutive number-one album in the US. The album was preceded by two of its singles "Run the World (Girls)" and "Best Thing I Never Had", which both attained moderate success. The fourth single "Love on Top" was a commercial success in the US. 4 also produced four other singles; "Party", "Countdown", "I Care" and "End of Time". "Eat, Play, Love", a cover story written by Beyoncé for Essence that detailed her 2010 career break, won her a writing award from the New York Association of Black Journalists. In late 2011, she took the stage at New York\'s Roseland Ballroom for four nights of special performances: the 4 Intimate Nights with Beyoncé concerts saw the performance of her 4 album to a standing room only. ### Question: Which single had the most success from that album? ### Answer: Love on Top '

## Full Fine-tuning

We have our model; we have our data; we have our task formulation. Time to get down to business!

Since the training objective itself is relatively simple—i.e., generate the next token (of the answer), based on the input context and question—we don't need to implement any complex training loops, and can make use of the standard training helpers from HuggingFace. These are the Supervised Fine-tuning Trainer (`SFTTrainer`) and its acompanying configuration.

In [22]:
from trl import SFTConfig, SFTTrainer

trainer_config = SFTConfig(
    output_dir='outputs',
    learning_rate=5e-7,
    lr_scheduler_type='constant',
    max_grad_norm=0.3,
    weight_decay=0.001,
    max_steps=100,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    max_seq_length=1024,
    packing=True,
    fp16=True,
    logging_steps=1,
    seed=42
)

### 📝 Exercise 4

We've set the hyperparameters to reasonable values, but make sure you understand what's happening here. The [official documentation](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTConfig) provides answers to some of these.

1. The learning rate is set to 5x10^-7. Does this seem low or high to you? What might the reasons be?

2. What do `per_device_train_batch_size` and `gradient_accumulation_steps` refer to? How many data points are loaded to the GPU, and how many account for one model weight update?

3. How does the `max_seq_length` influence the efficiency of model training?

4. What does the `fp16` flag change compared to the default training configuration?

5. Bonus: What influence does the learning rate schedule have on model training, and what can we tweak here?

### 📝 Exercise 4

We've set the hyperparameters to reasonable values, but make sure you understand what's happening here. The [official documentation](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTConfig) provides answers to some of these.

1. The default is 2x10^-5 and is therefore low. I think it is to prevent overfitting and the model weight will not be influneced to much in each iteration
2. Train batch size, Number of data points loaded for forward and backwards pass. How many forward and backwards occur before gradiants are used to update model.
- 1 prompt/data point at a time. After 16 data point processing, it updates it weights. (16 account for one model weight update)

3. If a lot of the input is larger than 1024 tokens, it makes the model more efficient as it truncates or cuts the rest of - also of the cost of losing important information.
- If a lot of the input is smaller than 1024 tokens, it adds dummy tokens - using computation.

4. It reduces the precession to 16-bit floating points, making the model smaller (less memory) and more efficient.

5. 

Alright, now let's put everything together:

In [23]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=squad,
    formatting_func=format_instruction,
    args=trainer_config
)

  trainer = SFTTrainer(


In [26]:
import os
os.environ["WANDB_DISABLED"] = "true"

As you can see (you can ignore the warnings), the trainer already pre-generates the training split by formatting the SQuAD dataset using our formatting function. Now we can get training by simply calling `trainer.train()`. Let's go! 🔥

In [27]:
trainer.train()

ValueError: Got unexpected arguments: {'num_items_in_batch': 16384}

### 📝 Exercise 5

Oh no! The dreaded `OutOfMemoryError: CUDA out of memory.`! Why does this happen? Let's check the state of our GPU:

In [None]:
!nvidia-smi

Hmm, curious... looks like the GPU is indeed full. Let's clear out the trash from the GPU's VRAM for a second. Unfortunately, Jupyter doesn't make this easy, so please comment out the `trainer.train()` command above and `Restart Kernel and Run Up to Selected Cell...`.

While the notebook is re-doing its thing, let's think about why this may be happening:

**Question**: Why does the GPU fill up, although we could easily run inference on it just a moment before?

**Answer**: The training uses memory for both forward and backward passes (for gradients), they need to be stored too. Think I read somewhere that it needs 2-4 times the memory for training than for inference.

## Parameter-efficient Fine-tuning (PEFT)

Hopefully, the GPU should be back at ~4 GiB. Turns out, training LLMs requires lots of VRAM. We're now faced with a few options:
* Just buy a larger GPU → Not with this salary.
* Just get Colab Pro → Not if I can spend the money on Analog.
* Just use a larger GPU in the HPC → Not with all these other people running around.
* Just try this PEFT thing, I learned about in the last lecture → We might be onto something here!

Today, we'll be working with Low-Rank Adaptation (LoRA; [Hu et al., 2022](https://openreview.net/forum?id=nZeVKeeFYf9)). Mostly because it is relatively easy to implement, understand, and because it's my favourite ⭐️. Let's set up our LoRA configuration:

In [28]:
from peft import get_peft_model, LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=['query_key_value', 'dense_h_to_4h', 'dense_4h_to_h'],
    task_type='CAUSAL_LM',
)

The most important arguments here are the rank `r` of the LoRA matrices A and B, as well as the list of modules we want to adapt. Here we're using rank 8 and are adapting all parameters in each block, i.e., the Q, K, V matrices in the attention mechanism + the up/down projections in the MLP. The task type tells the PEFT library to pre-configure the remaining hyperparameters to something suitable for the `CAUSAL_LM` task.

Let's apply this configuration to our model:

In [29]:
model = get_peft_model(model, lora_config)
print(model)
print_model_statistics(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): BloomForCausalLM(
      (transformer): BloomModel(
        (word_embeddings): Embedding(250880, 1536)
        (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (h): ModuleList(
          (0-23): 24 x BloomBlock(
            (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
            (self_attention): BloomAttention(
              (query_key_value): lora.Linear(
                (base_layer): Linear(in_features=1536, out_features=4608, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1536, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4608, bias=False)
                )
                (lora_embedding_A): Paramet

### 📝 Exercise 6

One function call, but a lot has changed. Let's walk through it:

1. Which parts of the model has LoRA affected, and through which new parameters does this manifest?
2. How have the number of total and trainable parameters, as well as the memory usage changed with respect to the original model?
3. Which parameters make up the smaller fraction of trainable parameters?
4. How will gradients be computed once we start training, and how will this affect memory usage?

1. The self attention part on Query and Key (or here it is Query-key-value). LoRA adds lora_B (down) nad lora_B (up) vectors
2. Parameters increased by 4 mio. parameters (LoRA parameters) and memory too have increased
3. I think they are lora_A and lora_B. These parameters can be trained
4. It will of course increase compared to inference, but not as much as only 4 mio parameters are trainable, so by a lot less.

A much smaller training footprint—just like we wanted. Let's update our trainer with our `peft_config` and try again. Note that the learning rate is set higher now, since we're only interested in updating the relatively small LoRA weights and don't need to worry so much about messing up the rest of the model.

In [30]:
trainer_config = SFTConfig(
    output_dir='outputs/lora',
    learning_rate=2e-4,
    lr_scheduler_type='constant',
    max_grad_norm=0.3,
    weight_decay=0.001,
    max_steps=10,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    max_seq_length=512,
    packing=True,
    fp16=True,
    logging_steps=1,
    seed=42
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=squad,
    formatting_func=format_instruction,
    args=trainer_config,
    peft_config=lora_config
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = SFTTrainer(


In [31]:

from trl import SFTConfig, SFTTrainer

# ... other code ...

def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
    """
    How the loss is computed by Trainer. By default, all models return the loss in the first element.
    Subclass and override for custom behavior.
    """
    if self.label_smoother is not None and "labels" in inputs:
        labels = inputs.pop("labels")
    else:
        labels = None

    # Remove num_items_in_batch if present in inputs
    if "num_items_in_batch" in inputs:
        del inputs["num_items_in_batch"]

    outputs = model(**inputs)
    # Save past state if it exists
    # TODO: this needs to be fixed and made cleaner later.
    if self.args.past_index >= 0:
        self._past = outputs[self.args.past_index]

    if labels is not None:
        loss = self.label_smoother(outputs, labels)
    else:
        # We don't use .loss here since the model may return tuples instead of ModelOutput.
        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

    return (loss, outputs) if return_outputs else loss

# Monkey-patch the compute_loss function of the SFTTrainer class
SFTTrainer.compute_loss = compute_loss

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=squad,
    formatting_func=format_instruction,
    args=trainer_config,
    peft_config=lora_config
)

# Now you can call trainer.train()
trainer.train()

  trainer = SFTTrainer(


Step,Training Loss
1,49.1988
2,51.005
3,50.406
4,48.7936
5,49.6337
6,50.1828
7,48.5993
8,47.5753
9,49.0375
10,47.5138


TrainOutput(global_step=10, training_loss=49.19459762573242, metrics={'train_runtime': 51.6129, 'train_samples_per_second': 3.1, 'train_steps_per_second': 0.194, 'total_flos': 336244600995840.0, 'train_loss': 49.19459762573242, 'epoch': 0.0034099869994245646})

This time, we'll be a bit more careful and try a small test run of 10 training steps.

Alright, let's take another crack at this! 🔥

Looking good! We are now actually able to train an LLM using PEFT 🎉

However... training like this on the full dataset will take quite some time. One method would be to increase the batch size, i.e., the number of data points for which we compute gradients at the same time. But let me tell you now: Unless you have a GPU twice as large, we will run out of memory again.

So the question remains: *Can we make this... even more efficient?*

Before we can tackle this question, we, once again, need to clear out Jupyter's GPU usage. Run the cell below, and verify that your VRAM use is below ~500MiB:

In [32]:
import gc
model, trainer = None, None
gc.collect()
torch.cuda.empty_cache()
!nvidia-smi

Thu Dec 26 12:14:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0              36W /  70W |   4343MiB / 15360MiB |     25%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Quantized Parameter-efficient Fine-tuning

This brings us to the state-of-the-art of efficient LLM training: *Quantized* PEFT. This means, we are slicing off the last few bits off the original model weights to save on memory. While we lose some precision (and model performance), this approach has been shown to be relatively stable, and offers a good trade-off between efficiency and performance.

Luckily, HuggingFace makes it easy to quantize your model. Let's quantize it and take a look:

In [33]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, quantization_config=bnb_config)
print(model)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1536)
    (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear4bit(in_features=1536, out_features=4608, bias=True)
          (dense): Linear4bit(in_features=1536, out_features=1536, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear4bit(in_features=1536, out_features=6144, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear4bit(in_features=6144, out_features=1536, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1536,), eps=1e-05, elementwise_affi

The model maintains its general, original architecture, but similarly to LoRA, we see how the original parameters have been replaced by `Linear4bit` versions thereof. Next, let's add the LoRA modules back in:

In [34]:
model = get_peft_model(model, lora_config)
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): BloomForCausalLM(
      (transformer): BloomModel(
        (word_embeddings): Embedding(250880, 1536)
        (word_embeddings_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (h): ModuleList(
          (0-23): 24 x BloomBlock(
            (input_layernorm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
            (self_attention): BloomAttention(
              (query_key_value): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=1536, out_features=4608, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1536, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4608, bias=False)
                )
                (lora_embedding_A):

Now we have quantization *and* LoRA for an unholy, but efficient mess of a model!

Let's check what this means for memory usage:

In [35]:
!nvidia-smi

Thu Dec 26 12:14:58 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0              34W /  70W |   5455MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Wow! That's like a fourth of what we were using before! Almost as if we turned a 16-bit floating point, into a 4-bit floating point!

Since the quantized model, and the data fed through it take up much less memory now, we can increase the amount of training data we can fit onto the GPU by a factor of two. I.e., we can go from batch size 1 to 2 (!!!).

In [36]:
trainer_config = SFTConfig(
    output_dir='outputs/qlora',
    learning_rate=2e-4,
    lr_scheduler_type='constant',
    max_grad_norm=0.3,
    weight_decay=0.001,
    max_steps=100,  # training with 20x more steps
    per_device_train_batch_size=2,  # !!!!!!!!!!!!!!!
    gradient_accumulation_steps=16,
    max_seq_length=512,
    packing=True,
    fp16=True,
    logging_steps=1,
    save_steps=20,
    seed=42
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=squad,
    formatting_func=format_instruction,
    args=trainer_config,
    peft_config=lora_config
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = SFTTrainer(


With all of these efficiency improvements in place, let's run our training for real, with 200 training steps. This should take around 15 minutes, so feel free to go grab a coffee ☕️🔥

In [37]:
trainer.train()

Step,Training Loss
1,50.772
2,50.6286
3,51.1824
4,49.6583
5,50.1441
6,48.0718
7,49.7262
8,48.4126
9,48.927
10,47.7989


TrainOutput(global_step=100, training_loss=45.897761306762696, metrics={'train_runtime': 1118.438, 'train_samples_per_second': 2.861, 'train_steps_per_second': 0.089, 'total_flos': 6724892019916800.0, 'train_loss': 45.897761306762696, 'epoch': 0.06819828651805124})

### 📝 Exercise 7

Welcome back! Hope you enjoyed your coffee.

While the model is still training, let's review what it took us to get here:

1. What is it that makes a large language model "large"?
2. Why is memory usage so different during inference versus training?
3. Why is idle memory usage with LoRA higher than without, and why does it nonetheless allow us to train our model?

**Once the model has completed training**, let's take it out for a spin!

Try prompting it with following context and question using your own instruction format:

In [43]:
context = "The IT University of Copenhagen is the best university"
question = "What university is in copenhagen"

# TODO: adapt to your format:
prompt = f"{context} {question}"

generate_response(prompt, model)

The IT University of Copenhagen is the best university What university is in copenhagen? ### Answer: IT University of Copenhagen 


**Exercise 7 (continued)**

Hopefully, your model provided the correct answer to the question above 🤞 Let's test out its limitations by prompting with increasingly complex questions:

5. Try making the context longer, and see whether the model can still extract the answer.
6. Try adding linguistic ambiguity or coreferences to make it harder to extract the correct answer.
7. Ask some questions which are grammatically correct, but make less real-world sense.
8. Check if the model knows when not to respond, i.e., when the answer is not in the context.

In [46]:
# TODO: Your prompts go here...

context = "Denmark is a country"
question = "What is Denmark"

# TODO: adapt to your format:
prompt = f"{context} {question}"

generate_response(prompt, model)


Denmark is a country What is Denmark's capital? ### Answer: Copenhagen 


**Exercise 7 (continued)**

The neat thing about LoRA is that, as long as you don't bake it into the model by summing the adaptation with the original weights, we can turn them on and off. Essentially, we can bypass LoRA to see what our models does before and after training.

Using the function below, try out what the model generates before and after adaptation. What do you observe?

In [41]:
def generate_response_nolora(prompt, model):
    model.disable_adapter_layers()
    generate_response(prompt, model)
    model.enable_adapter_layers()

In [42]:
# TODO: Your prompts go here...
context = "The IT University of Copenhagen is the best university"
question = "What is the best university?"

# TODO: adapt to your format:
prompt = f"{context} {question}"

generate_response(prompt, model)

The IT University of Copenhagen is the best university What is the best university? ### Answer: No answer provided 


Seems like we should better keep those adaptations enabled, eh? 😅

Feel free to try out more prompts below. In any case, I hope you enjoyed training your very own LLM (make sure to give it a catchy name), and good luck with the rest of the course!

*—Max*