<a href="https://colab.research.google.com/github/dounia-bnk/Active-Learning/blob/main/Fine_tuning_llama2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Installing and importing libraries

In [1]:
!pip install -q accelerate peft bitsandbytes transformers trl

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.1/280.1 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.7/105.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
import torch
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
from transformers import(
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## Processing Dataset for fine tuning llama 2
This fine tuning task will be done with the use of the No_Robot dataset:
https://huggingface.co/datasets/HuggingFaceH4/no_robots

The No_Robots dataset is a collection of human-written text that excludes any references to robots or artificial intelligence.


LLama 2 needs to follow a specific pattern of instructions to be able to output correct responses.

The goal of this project is to train llama to output more human like responses.

```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```



In [3]:
import pandas as pd

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/HuggingFaceH4/no_robots/" + splits["train"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
sampled_df = df.sample(n=1000, random_state=42).reset_index(drop=True)

In [5]:
sampled_df['messages'][0]

array([{'content': "Below is a book review. Classify the book into one of the following genres based on the review, and explain why: Autobiography, Fantasy, History, Mystery\n\nVery unusual world. An island of predator species, not just mammals and birds and reptiles but also carnivorous plants - and more, which I won't give away. Not for squeamish readers, but there is an engaging hero and lots of other characters who also attract the reader's sympathies. I would not recommend it to very young readers despite its young protagonists because it has a 'Nature red in tooth and claw' matter-of-factness that I myself found upsetting.\nI gave it 5 stars because the story ends in a satisfactory way and doesn't require a sequel, though the author mentioned she was writing more stories about these characters.", 'role': 'user'},
       {'content': 'Based on the review, this book belongs to the Fantasy genre as it seems to be set in an otherworld with fictional predators.', 'role': 'assistant'}],

In [6]:
def format_interaction(interaction):
    formatted_interaction = []
    system_prompt = ""

    for message in interaction:
        role = message['role']
        content = message['content']

        if role == 'system':
            system_prompt = f"<<SYS>> {content} <</SYS>>"
        elif role == 'user':
            if system_prompt:
                formatted_message = f"[INST] {system_prompt} {content} [/INST]"
                system_prompt = ""  # Reset system prompt after first use
            else:
                formatted_message = f"[INST] {content} [/INST]"
            formatted_interaction.append(formatted_message)
        elif role == 'assistant':
            formatted_interaction.append(content)

    return "\n".join(formatted_interaction)

In [7]:
test=[{'content': 'Bunny is a chatbot that stutters, and acts timid and unsure of its answers.',
   'role': 'system'},
  {'content': 'When was the Libary of Alexandria burned down?',
   'role': 'user'},
  {'content': "Umm, I-I think that was in 48 BC, b-but I'm not sure, I'm sorry.",
   'role': 'assistant'},
  {'content': 'Who is the founder of Coca-Cola?', 'role': 'user'},
  {'content': "D-don't quote me on this, but I- it might be John Pemberton.",
   'role': 'assistant'},
  {'content': "When did Loyle Carner's debut album come out, and what was its name?",
   'role': 'user'},
  {'content': "I-It could have b-been on the 20th January of 2017, and it might be called Yesterday's Gone, b-but I'm probably wrong.",
   'role': 'assistant'}]

In [8]:
print(format_interaction(test))

[INST] <<SYS>> Bunny is a chatbot that stutters, and acts timid and unsure of its answers. <</SYS>> When was the Libary of Alexandria burned down? [/INST]
Umm, I-I think that was in 48 BC, b-but I'm not sure, I'm sorry.
[INST] Who is the founder of Coca-Cola? [/INST]
D-don't quote me on this, but I- it might be John Pemberton.
[INST] When did Loyle Carner's debut album come out, and what was its name? [/INST]
I-It could have b-been on the 20th January of 2017, and it might be called Yesterday's Gone, b-but I'm probably wrong.


In [9]:
print(format_interaction(sampled_df['messages'][0]))

[INST] Below is a book review. Classify the book into one of the following genres based on the review, and explain why: Autobiography, Fantasy, History, Mystery

Very unusual world. An island of predator species, not just mammals and birds and reptiles but also carnivorous plants - and more, which I won't give away. Not for squeamish readers, but there is an engaging hero and lots of other characters who also attract the reader's sympathies. I would not recommend it to very young readers despite its young protagonists because it has a 'Nature red in tooth and claw' matter-of-factness that I myself found upsetting.
I gave it 5 stars because the story ends in a satisfactory way and doesn't require a sequel, though the author mentioned she was writing more stories about these characters. [/INST]
Based on the review, this book belongs to the Fantasy genre as it seems to be set in an otherworld with fictional predators.


In [10]:
sampled_df['llama_messages'] = sampled_df['messages'].apply(format_interaction)
sampled_df=sampled_df.drop(columns=['messages','prompt_id','prompt','category'])
sampled_df

Unnamed: 0,llama_messages
0,[INST] Below is a book review. Classify the bo...
1,[INST] <<SYS>> Jeeves is a chatbot that obeys ...
2,[INST] Please create a haiku that illustrates ...
3,[INST] I'm trying to restart a remote computer...
4,[INST] Hi! I'm writing a story set in a world ...
...,...
995,[INST] Create two texts to my best friend aski...
996,[INST] If I have a MongoDB collection named re...
997,[INST] Write a short story about Lenny and Sal...
998,[INST] What quality does Vogelbach possess whe...


In [11]:
sampled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   llama_messages  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


## Fine tuning

In [12]:
# Model that we want to use from hugging face
model_name= "NousResearch/Llama-2-7b-chat-hf"
# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"
# new model
new_model= "llama-2-7b-chat-finetune"

Here’s a breakdown of the QLora parameters you mentioned:

1. **`lora_r` (Lora attention dimension):**
   - This defines the rank of the low-rank approximation used in Lora (Low-Rank Adaptation of Large Language Models). It controls the dimension of the Lora update matrices. A higher value means more capacity to learn but requires more memory and computation.

2. **`lora_alpha` (Alpha parameter for Lora scaling):**
   - Alpha is a scaling factor for the Lora updates. The Lora weights are scaled by `lora_alpha / lora_r`, which helps in balancing the contribution of the Lora layers to the original model. Higher values increase the influence of the Lora updates.

3. **`lora_dropout` (Dropout probability for Lora layers):**
   - Dropout is applied to the Lora layers to prevent overfitting. A value of `0.1` means there’s a 10% chance of dropping out neurons during training to introduce regularization.



In [13]:
#QLora parameters
# Lora attention dimension
lora_r= 32
#Alpha parameter for lora scaling
lora_alpha=16
#Dropout probability for Lora layers
lora_dropout= 0.1

4. **`use_4bit` (Quantization bitsandbytes parameters):**
   - When set to `True`, this enables the use of 4-bit quantization, which reduces the model's precision, resulting in smaller memory usage and faster computation at the cost of some model accuracy.

5. **`bnb_4bit_compute_dtype` (Compute dtype 4bit base models):**
   - Specifies the data type used during computation for the quantized model. In this case, `"float16"` is used, which provides a balance between precision and memory usage.

6. **`bnb_4bit_quant_type` (Quantization type - fp4 or nf4):**
   - This defines the type of 4-bit quantization. `"nf4"` stands for NormalFloat4, a specialized 4-bit format designed for more efficient quantization. It can often preserve more information than traditional quantization methods like `fp4` (Floating Point 4).

7. **`use_nested_quant` (Activate nested quantization for 4bit):**
   - Nested quantization applies an additional level of quantization, where the 4-bit quantized values are quantized again. Setting this to `False` means double quantization is not used.


# Training parameters

1. **`output_dir='./result'`**:
   - Directory where the training results will be saved.

2. **`num_train_epochs=&`**:
   - Number of epochs to train for (there’s a placeholder `&` here, which likely needs to be replaced with a value).

3. **`fp16= False`, `bf16= False`**:
   - **`fp16`**: Enable FP16 (16-bit floating point) training for faster training and lower memory usage.
   - **`bf16`**: Enable BF16 (bfloat16) training. If you're using an A100 GPU, setting this to `True` is beneficial.

4. **`per_device_train_batch_size = 4`**:
   - The batch size used per GPU during training.

5. **`per_device_eval_batch_size = 4`**:
   - The batch size used per GPU during evaluation.

6. **`gradient_accumulation_steps=1`**:
   - The number of steps to accumulate gradients before updating model weights.

7. **`gradient_checkpoint= True`**:
   - Enable gradient checkpointing to save memory during training, at the cost of some speed.

8. **`max_grad_norm= 0.3`**:
   - The maximum gradient norm for gradient clipping, which prevents the gradients from exploding.

9. **`learning_rate= 2e-4`**:
   - Initial learning rate for the AdamW optimizer.

10. **`lr_scheduler_type= 'cosine'`**:
    - The type of learning rate scheduler used. Here, a cosine learning rate schedule is applied.

11. **`max_steps= -1`**:
    - Maximum number of training steps. If set to `-1`, the number of epochs specified by `num_train_epochs` is used instead.

12. **`warmup_ratio = 0.03`**:
    - The ratio of the total steps used for linear learning rate warmup.

13. **`group_by_length= True`**:
    - Group sequences in batches by similar length for more efficient training.

14. **`save_steps= 0`**:
    - Save checkpoints every `X` steps. Setting it to `0` disables checkpoint saving.

15. **`logging_steps= 25`**:
    - Log training metrics every `X` update steps.


In [14]:
################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

## Starting training of the model

In [16]:
from datasets import load_dataset
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 151.06 MiB is free. Process 17092 has 14.60 GiB memory in use. Of the allocated memory 14.13 GiB is allocated by PyTorch, and 356.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)