<a href="https://colab.research.google.com/github/Valkea/Generative_AI/blob/main/LLM_experiments/Instruction_fine_tuning_%5BLllama7b_hf%5D_v03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sources:
- https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/
- https://www.philschmid.de/instruction-tune-llama-2

As I can't access an A100, let's build something without Flash-Attention

### Install depencies

In [None]:
#!pip install -q -U torch
#!pip install -q -U scipy

!pip install -q -U accelerate==0.21.0
!pip install -q -U bitsandbytes==0.40.2
!pip install -q -U datasets==2.13.1
!pip install -q -U transformers==4.31.0
!pip install -q -U peft==0.4.0
!pip install -q -U trl==0.4.7
!pip install -q -U safetensors==0.3.1

!pip install -q -U python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

### Check GPU

In [None]:
!nvidia-smi

Tue Aug  8 15:25:28 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    48W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Connect to Google Drive (so we can cache the models, datasets etc)

In [None]:
import os
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


### Define useful variables

In [None]:
from pathlib import Path

# model_name = 'meta-llama/Llama-2-7b-chat-hf' # gated
model_name = "meta-llama/Llama-2-7b-hf" # gated
# model_name = "NousResearch/Llama-2-7b-hf" # non-gated

sub_model_name = model_name.split('/')[-1]

base_path = Path('/content/drive/MyDrive/Colab Notebooks/NLP')
transformers_cache_path = Path(base_path, 'HuggingfaceCash')
datasets_cache_path = Path(transformers_cache_path, 'Datasets')
base_path_out = Path(base_path, f'fine_tuning_{sub_model_name}_instruct_v3')

os.environ['TRANSFORMERS_CACHE'] = str(transformers_cache_path)
os.environ['HF_DATASETS_CACHE'] = str(datasets_cache_path)

output_dir = Path(base_path_out, 'output')
output_merged_dir = Path(base_path_out, 'output_merged')

seed = 1234

### Load Llama2 HuggingFace API key

In [None]:
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # read local .env file

access_token = os.environ["LLAMA2_HF_API_KEY"]

### Load the training dataset we will use to fine-tune the model

In [None]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])



dataset size: 15011
{'instruction': 'Make a list of different fruits you could put in a fruit salad with at least 5 ingredients', 'context': '', 'response': 'Five fruits you could use are bananas, strawberries, apples, pears and blueberries', 'category': 'brainstorming'}


### Prepare prompts

In [None]:
import random
def emotize_text(text):
  symbols = ['♡','♥','❤','💔', '💝', '💓', '💕']
  return text.replace(' ', f" {random.choice(symbols)} ")

emotize_text("Hello World! How are you?")

'Hello 💕 World! 💕 How 💕 are 💕 you?'

In [None]:
def create_prompt_formats(sample, inference=False):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    :param inference: is the prompt for inference?
    """

    command =     "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    input_txt =  f"### Input:\n{sample['instruction']}"
    input_ctx =  f"### Context:\n{sample['context']}" if sample["context"] else None
    output_txt = f"### Response:\n{emotize_text(sample['response'])}" if inference == False else "### Response:\n"
    # end =       "### End" if inference == False else None

    parts = [part for part in [command, input_txt, input_ctx, output_txt] if part]

    return "\n\n".join(parts)

#### Let's test the format function on a sample

In [None]:
print(create_prompt_formats(dataset[randrange(len(dataset))]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Input:
What is a flat white.

### Response:
Originating 💔 from 💔 either 💔 Australia 💔 or 💔 New 💔 Zealand, 💔 a 💔 flat 💔 white 💔 is 💔 a 💔 coffee 💔 based 💔 drink 💔 typically 💔 consisting 💔 of 💔 a 💔 double 💔 shot 💔 of 💔 espresso 💔 and 💔 steamed 💔 milk. 💔  💔 A 💔 double 💔 shot 💔 of 💔 espresso 💔 amounts 💔 to 💔 60ml 💔 in 💔 volume 💔 and 💔 this 💔 is 💔 added 💔 to 💔 with 💔 approximately 💔 100ml 💔 of 💔 milk. 💔  💔 The 💔 process 💔 of 💔 steaming 💔 the 💔 milk 💔 modifies 💔 the 💔 proteins 💔 to 💔 create 💔 a 💔 thick 💔 foamy 💔  💔 consistency 💔 allowing 💔 baristas 💔 to 💔 create 💔 patterns 💔 on 💔 top 💔 of 💔 the 💔 milk 💔 from 💔 the 💔 brown 💔 of 💔 the 💔 coffee 💔 and 💔 white 💔 of 💔 the 💔 milk. 💔 This 💔 is 💔 often 💔 referred 💔 to 💔 as 💔 latte 💔 art.

Although 💔 similar 💔 to 💔 a 💔 latte, 💔 the 💔 flat 💔 white 💔 has 💔 a 💔 stronger 💔 coffee 💔 flavour 💔 due 💔 the 💔 having 💔 a 💔 higher 💔 proportion 💔 of 💔 coff

----

# Optimization for fine tuning on a single GPU

In order to optimize the RAM required for the fine-tuning we will use two techniques: **QLoRA** and **Flash Attention**.

> **QLoRA** *(Quantization-aware Low-Rank Adapter Tuning for Language Generation)* is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance.
>
> 1. *Quantize the pre-trained model to 4 bits and freeze it.*
> 2. *Attach small, trainable adapter layers. (LoRA)*
> 3. *Finetune only the adapter layers while using the frozen quantized model for context.*

> **Flash Attention** is a an method that reorders the attention computation and leverages classical techniques *(tiling, recomputation)* to significantly speed it up *(x3)* and reduce memory usage from quadratic to linear in sequence length.

#### Check if the current GPU can handle Flash-attention
Flash Attention is currently only available for Ampere (A10, A40, A100, ...) & Hopper (H100, ...) GPUs.

> **Note:** If the machine has less than 96GB of RAM and lots of CPU cores,<br>reduce the number of MAX_JOBS. (philschmid used 4, on a g5.2xlarge)

In [None]:
#!python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"
#!pip install -q ninja packaging
#!MAX_JOBS=4 pip install -q flash-attn --no-build-isolation

### Define BitsAndBytesConfig

In [None]:
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Define LoRA config based on QLoRA [paper](https://arxiv.org/abs/2305.14314)

In [None]:
# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

### Define Peft config for hyper-params exploration

In [None]:
from peft import LoraConfig

#peft_config = LoraConfig(
#        lora_alpha=16,
#        lora_dropout=0.1,
#        r=64,
#        bias="none",
#        task_type="CAUSAL_LM",
#)

def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

### Decide to use Flash Attention or not

In [None]:
use_flash_attention = True
# COMMENT IN TO USE FLASH ATTENTION
# replace attention with flash attention
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print("Using flash attention")
#     replace_attn_with_flash_attn()
#     use_flash_attention = True

# Initialize model and tokenizer

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

def load_model(model_name, bnb_config, auth_token=None):

    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    # 1. Model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        use_cache=False,
        device_map="auto",  # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
        use_auth_token = auth_token
    )
    model.config.pretraining_tp = 1

    # 2. Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        use_auth_token = auth_token
    )
    tokenizer.pad_token = tokenizer.eos_token # Needed for LLaMA tokenizer
    tokenizer.padding_side = "right"

    return model, tokenizer

In [None]:
model, tokenizer = load_model(model_name, bnb_config, auth_token=access_token)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



#### Check that the model is using flash attention

In [None]:
if use_flash_attention:
    from utils.llama_patch import forward
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"


# Prepare model for training

In [None]:
from peft import prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer

def train(model, tokenizer, dataset, output_dir, max_seq_length, training_args, format_function):

    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable() # X

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # 3 - Wrap model with PEFT
    modules = find_all_linear_names(model)
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)

    # 4 - Definer Trainer
    trainer = SFTTrainer( # SFTTrainer is the same as Trainer but it accepts a PEFT config so it can run LoRA fine-tuning.
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        packing=True,
        formatting_func=format_function,
        args=training_args,
    )

    # 5 - Verifying the datatypes before training
    # SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)

    do_train = True

    # 6 - Launch training
    # SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

    print("Training...")

    if do_train:
        train_result = trainer.train() # there will not be a progress bar since tqdm is disabled
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)

    # 7 - Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir) # trainer.save_model()

    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()

#### Define the hyperparameters to use

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=output_dir,
    max_steps=20, # we can replace the max_steps argument with num_train_epochs.
    # num_train_epochs=3,
    per_device_train_batch_size=1, # 6 if use_flash_attention else 4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=False, # was True
    tf32=False, # was True
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True # disable tqdm since with packing values are in correct
)

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=2,
    max_steps=20,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=1,
    # output_dir="outputs",
    output_dir=output_dir,
    optim="paged_adamw_8bit",
)

# Train

In [None]:
max_seq_length = 2048 # max sequence length for model and packing of the dataset
train(model, tokenizer, dataset, output_dir, max_seq_length, training_args, create_prompt_formats)



torch.float32 302387200 0.08541070604255438
torch.uint8 3238002688 0.9145892939574456
Training...


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,0.8559
2,1.0199
3,1.098
4,0.7547
5,1.0941
6,0.6342
7,0.8011
8,0.7409
9,0.7621
10,0.7387


***** train metrics *****
  epoch                    =       0.01
  total_flos               =  3121323GF
  train_loss               =     0.7643
  train_runtime            = 0:02:13.01
  train_samples_per_second =      0.601
  train_steps_per_second   =       0.15
{'train_runtime': 133.0185, 'train_samples_per_second': 0.601, 'train_steps_per_second': 0.15, 'total_flos': 3351495856619520.0, 'train_loss': 0.7643376767635346, 'epoch': 0.01}
Saving last checkpoint of the model...


# Merge weights
This might require to restart the colab instance to really free all the memory

In [None]:
if use_flash_attention:
    # unpatch flash attention
    from utils.llama_patch import unplace_flash_attn_with_attn
    unplace_flash_attn_with_attn()

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# load base LLM model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    # device_map="auto",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16, # torch.bfloat16
    load_in_4bit=True,
    use_auth_token = access_token,
)

# model = model.merge_and_unload()
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True,)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# tokenizer = AutoTokenizer.from_pretrained(output_dir, use_auth_token = access_token)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token = access_token)
tokenizer.save_pretrained(output_merged_dir)



('/content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct_v3/output_merged/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct_v3/output_merged/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct_v3/output_merged/tokenizer.json')

# Inference

In [None]:
# bnb_config = create_bnb_config()

# model, tokenizer = load_model(output_merged_dir, bnb_config, auth_token=access_token)
model, tokenizer = load_model(model_name, bnb_config, auth_token=access_token)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [None]:
from random import randrange

# Load dataset from the hub and get a sample
# dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
sample = dataset[randrange(len(dataset))]
sample = create_prompt_formats(sample, True)
#prompt = sample['instruction']
#print(prompt)
prompt = sample

In [None]:
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)

print(f"\n***** Prompt:\n{prompt}\n")
print(f"\n***** Generated answer:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"\n***** Ground truth:\n")


***** Prompt:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Input:
Where was Anne Zohra Berrached born?

### Context:
The daughter of an Algerian father, Anne Zohra Berrached was born and raised in the GDR. Following specialized secondary school in art, she earned a university degree in social pedagogy. Anne Zohra Berrached worked for two years in London as a drama teacher before spending one year abroad in Cameroon and Spain.

### Response:



***** Generated answer:
>In 1989, she arrived in Berlin via France with a group of children and participated in the first free elections in the German Democratic Republic.

### Output:
```
https://www.britannica.com/biography/Anne-Zohra-Berrached
```

### Explanation:
Anne Zohra Berrached was born in the GDR.
She was born in a GDR. She

***** Ground truth:

