<a href="https://colab.research.google.com/github/borbalita/llm-playground/blob/main/finetune_llama_8b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Llama 3.1 8B

This notebook finetunes a Llama 3.1 8B model to answer medical queries. It is a slightly modified / updated version of [this datacamp tutorial](https://www.datacamp.com/tutorial/llama3-fine-tuning-locally).

In [2]:
%%capture
!pip install accelerate peft bitsandbytes transformers trl wandb

In [3]:
import torch
from datasets import load_dataset
from peft import (
    LoraConfig,
    #AutoPeftModelForCausalLM,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    pipeline,
    logging,
)
from trl import SFTTrainer, SFTConfig, setup_chat_format
from typing import Tuple
from huggingface_hub import notebook_login
import os
import torch
import wandb

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mborbalita[0m ([33mborbalita-personal[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [6]:
run = wandb.init(
    project="Finetune Llama 3.1 8B on medical dataset.",
    job_type="training",
    anonymous="allow",  # user can log experiments without authentication, run data is stored in public projects
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


# Import pretrained Llama model

In [7]:
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
new_model_name = "llama-3-8B-chat-doctor"
dataset_name = "ruslanmv/ai-medical-chatbot"

torch_dtype = torch.bfloat16  # Note: for TPUs use bfloat16, for GPU float16
attn_implementation = "eager"

In [8]:
def get_model_and_tokenizer(model_id: str) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
    # QLoRA settings for memory efficiency
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",  # Normalized Float 4
        bnb_4bit_compute_dtype=torch_dtype,
        bnb_4bit_use_double_quant=True,  # Use only if GPU has very limited VRAM
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",  # automatically places model on best available device (GPU or CPU)
        attn_implementation=attn_implementation,
    )

    #model.config.use_cache=False  # disables caching of key-value pairs during inference => reduces memory usage, but slows down inference
    #model.config.pretraining_tp=1  # tensor parallelism - forces single-GPU execution

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.chat_template = None
    model, tokenizer = setup_chat_format(model, tokenizer)

    return model, tokenizer

In [9]:
model, tokenizer = get_model_and_tokenizer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [10]:
!nvidia-smi

Tue Feb 18 13:26:29 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   62C    P0             30W /   70W |   12912MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Dataset

In [11]:
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=19).select(range(1000))

def format_chat_template(row):
    row_json = [
        {
            "role": "user",
            "content": row["Patient"]
        },
        {
            "role": "assistant",
            "content": row["Doctor"]
        }
    ]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(format_chat_template, num_proc=4)
dataset = dataset.train_test_split(test_size=0.1, seed=1919)

README.md:   0%|          | 0.00/863 [00:00<?, ?B/s]

dialogues.parquet:   0%|          | 0.00/142M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/256916 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [12]:
dataset["train"][0]

{'Description': 'Suggest treatment for a painful lump on the mouth',
 'Patient': 'Hi, I have two lumps on top of my mouth. One the size of a grain of rice just over top of my molar that you can see if I lift up my lip, and used to be a white now reddish colour. The other is a larger lump on the back side of my gums on top of the molar and is painful to touch. Sometimes when I press my tongue against this lump it sends puss out through the small lump on the other side. About a year ago this molar was filled. Since then that molar sometimes feels wiggly and is hard to chew on. What do you think is happening?',
 'Doctor': 'Hi Dear,Welcome to HCM.Understanding your concern. As per your query two painful lump on the mouth could be because of tooth infection which is resulting in abscess and pus formation which is spreading to tissue spaces and causing facial space infection and pain. I would suggest you to consult oral surgeon for proper diagnosis and to rule out systemic causes like sinusi

# Finetune the model

In [13]:
peft_config = LoraConfig(
    lora_alpha=32,  # controls influence of LoRA updates, typically between 8 to 32, too low => finetuning ineffectice, too high => overfitting
    lora_dropout=0.05,  # if too high (>0.2), can slow down training
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        #"lm_head",
    ]
)

In [30]:
sft_config = SFTConfig(
    # ------
    # General training params:
    # ------
    output_dir=new_model_name,
    per_device_train_batch_size=1,  # larger better, but watch out for memory constrains
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,  # compensates small batch size
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    num_train_epochs=1,
    #max_steps=250,  # overwrites num_train_epochs
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=0.2,
    logging_steps=10,
    warmup_steps=10,  # gradually increases learning rate from 0 to lr, prevents large updates at the beginning of training
    save_strategy="epoch",
    lr_scheduler_type="cosine",
    fp16=False,
    bf16=False,
    #push_to_hub=True,
    group_by_length=True,
    report_to="wandb",
    # ------
    # Parameters specific to SFT:
    # ------
    dataset_text_field="text",
    max_seq_length=512,
    packing=False
)



In [31]:
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
)

  trainer = SFTTrainer(


Applying chat template to train dataset:   0%|          | 0/900 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/900 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/900 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

In [32]:
trainer.train()

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 22.12 MiB is free. Process 12728 has 14.53 GiB memory in use. Of the allocated memory 14.15 GiB is allocated by PyTorch, and 256.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# Model Evaluation

In [None]:
wandb.finish()
model.config.use_cashe = True

In [19]:
from transformers import GenerationConfig
from time import perf_counter

In [20]:
generation_config = GenerationConfig(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.2,
    penality_alpha=0.6,
    do_sample=True,
    top_k=5,
    max_new_tokens=60,
)

In [28]:

def generate_response(query: str, model: AutoModelForCausalLM, tokenizer: AutoTokenizer) -> str:
    start_time = perf_counter()

    messages = [
        {
            "role": "user",
            "content": query
        }
    ]

    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generated_prompt=True)

    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

    outputs = model.generate(
        **inputs,
        generation_config=generation_config,
    )

    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    end_time = perf_counter()
    print(f"Response generated in {end_time - start_time:.2f}s")

    return text.split("assistant")[1]

In [29]:
query = "I have a red, itchy rash on my belly. Can you tell me what it is?"
generate_response(query, model, tokenizer)

RuntimeError: expected scalar type BFloat16 but found Float

# Save the model

In [None]:
# Only saves the adapter if parameter efficient fine-tuning is used.
trainer.model.save_pretrained(new_model_name)
trainer.model.push_to_hub(new_model_name, use_temp_dir=False)