<a href="https://colab.research.google.com/github/aojieyin/LLM-course-2025/blob/main/week-4/supervised_finetuning_new.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Large Language Model

In this lecture we will be looking at how to fine-tune an existing pre-trained language model.

## Learning outcomes
* You will learn how to download a pre-trained model and a training dataset from Hugging Face.
* You will learn how to fine-tune the downloaded model with the dataset using Hugging Face trl library and the supervised fine-tuning (SFT) method.
* You will learn how to use the fine-tuned model to generate text based on user input / prompts.
* You will learn how to upload the fine-tuned model to your own Hugging Face repository so that it can be used later or shared with other users.

## Prerequistes
* You will need the following free accounts: Google, Hugging Face and Weights & Biases. You may use your existing accounts or create new accounts for the purposes of this course.
* We will use the [Hugging Face](https://huggingface.co/) libraries: transformers (for models), datasets (for datasets), trl (for training). We will also store the fine-tuned models in a Hugging Face repository.
* Training is done using [Google Colab](https://colab.research.google.com/), which provides free access to Jupyter notebooks backed with a GPU compute required for fine-tuning.
* For monitoring the training run we will use [Weights & Biases](https://wandb.ai/)


## Fine-tuning

Let's first install some pre-requisites using Python's package manager pip

In [None]:
!pip install transformers peft accelerate
!pip install -U datasets
!pip install -q trl xformers wandb einops sentencepiece bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m899.7/899.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m594.3/594.3 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m140.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.0/88.0 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m954.8/954.8 kB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.1/193.1 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.6/63.6 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Then we need to import the required libraries

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
import torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from huggingface_hub import notebook_login

We will download a pre-trained large language model from Hugging Face and a dataset to train the model with. Below we assign these to variables we will use later. We will also set the name of the repository and model for the fine-tuned model.

In [None]:
# Pre trained model
model_name = "microsoft/Phi-3-mini-4k-instruct"

# Dataset name
dataset_name = "Open-Orca/OpenOrca"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "aojieyin/Phi-3-mini-4k-instruct-finetune"

To access your Hugging Face account, you need to log in. First go to your Hugging Face account, click *Settings* and select *Access Tokens*. Create a new token and copy the token. Then execute the below login command and when asked paste an access token.  

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's then download a subset of the dataset we want to use. Below we limit the dataset to the first 10,000 examples in order to save time. In real life you would probably use the full dataset.

In [None]:
dataset = load_dataset(dataset_name, split="train[:10000]")

def format_example(example):
    system_prompt = example.get("system_prompt", "You are a helpful AI assistant.")
    user_prompt = example["question"]
    assistant_answer = example["response"]

    # Updatie the prompt template to match the official Phi-3-mini format used in fine-tuning
    prompt = (
        f"<|system|>\n{system_prompt}<|end|>\n"
        f"<|user|>\n{user_prompt}<|end|>\n"
        f"<|assistant|>\n{assistant_answer}<|end|>"
    )

    return {"text": prompt}

dataset = dataset.map(
    format_example,
    remove_columns=dataset.column_names,
    num_proc=4
)
dataset = dataset.shuffle(seed=42).select(range(50))
dataset["text"][0]

README.md: 0.00B [00:00, ?B/s]

1M-GPT4-Augmented.parquet:   0%|          | 0.00/1.01G [00:00<?, ?B/s]

3_5M-GPT3_5-Augmented.parquet:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

'<|system|>\nYou are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|end|>\n<|user|>\nWhat is the sentiment of the following movie (choose your answer from the options) review sentence?\noffering fine acting moments and pungent \nChoose your answer from:\n[A]. negative;\n[B]. positive;\nThe answer is:<|end|>\n<|assistant|>\nThe answer is:\n[B]. positive;<|end|>'

Let's then download the model. We first create a config object for quantization of the model using bitsandbytes. Bitsandbytes enables accessible large language models via k-bit quantization for PyTorch.

We also need to download the tokenizer.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(False, True)

Below we set the access token to Waights & Biases. You should copy your access token from your account at [https://wandb.ai](https://wandb.ai).

In [None]:
#monitering login
wandb.login(key="27c3c265218712c5f79e3f47c807146607f7f58a") # Add your WANDB key here
run = wandb.init(project='Fine tuning Phi-3-mini', job_type="training", anonymous="allow")

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Then we'll create a configuration for the lo-rank adaptation method we will use.

In [None]:
peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj"]
)

We need to set the training arguments for the training run.

Finally we create the trainer object that uses supervised fine-tuning (SFT) as the training method.

In [None]:
sft_config = SFTConfig(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    optim="paged_adamw_8bit",
    save_steps=1000,
    logging_steps=30,
    learning_rate=2e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.3,
    group_by_length=True,
    lr_scheduler_type="linear",
    report_to="wandb",
    dataset_text_field="text",
    packing=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=sft_config, # 使用 sft_config
)

Adding EOS to train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (4344 > 4096). Running this sequence through the model will result in indexing errors


Truncating train dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Then, we can execute the training run. This will approximately 8 hours using the T4 GPU available in Colab and the dataset of 10,000 samples we downloaded.

In [None]:
# Train model
trainer.train()

Step,Training Loss


TrainOutput(global_step=4, training_loss=1.5369558334350586, metrics={'train_runtime': 10.163, 'train_samples_per_second': 4.92, 'train_steps_per_second': 0.394, 'total_flos': 725577428422656.0, 'train_loss': 1.5369558334350586, 'entropy': 1.3096072503498621, 'num_tokens': 21249.0, 'mean_token_accuracy': 0.64246894632067, 'epoch': 1.0})

In [None]:
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

0,1
train/entropy,▁
train/epoch,▁
train/global_step,▁
train/mean_token_accuracy,▁
train/num_tokens,▁

0,1
total_flos,725577428422656.0
train/entropy,1.30961
train/epoch,1.0
train/global_step,4.0
train/mean_token_accuracy,0.64247
train/num_tokens,21249.0
train_loss,1.53696
train_runtime,10.163
train_samples_per_second,4.92
train_steps_per_second,0.394


Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=3072, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=3072, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (qkv_proj): Linear4bit(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear4bit(in_

In [None]:
def stream(user_prompt: str, system_prompt: str = "You are a helpful AI assistant."):
    model.eval()
    device = next(model.parameters()).device

    # Updatie the prompt template to match the official Phi-3-mini format used in inference
    prompt = (
        f"<|system|>\n{system_prompt}<|end|>\n"
        f"<|user|>\n{user_prompt}<|end|>\n"
        f"<|assistant|>\n"
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    streamer = TextStreamer(
        tokenizer,
        skip_prompt=True,
        skip_special_tokens=True,
    )

    with torch.inference_mode():
        model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=True,
            temperature=0.3,
            top_p=0.9,
            eos_token_id=tokenizer.convert_tokens_to_ids("<|end|>"),
            streamer=streamer,
        )


In [None]:
stream("what is newtons 3rd law and its formula")




Grade: grade-6 science

Topic: Calculate the percentages of traits

Keyword: Punnett square


Exercise:
A geneticist is studying a rare trait in a population of 1000 individuals. The trait is controlled by a single gene with two alleles, A (dominant) and a (recessive). The geneticist finds that 360 individuals express the recessive trait. Assuming Hardy-Weinberg equilibrium, calculate the following:
a) The frequency of the recessive allele (a).
b) The frequency of the dominant allele (A).
c) The expected number of heterozygous individuals (Aa) in the population.
d) The expected number of homozygous dominant individuals (AA) in the population.

Solution:
To solve this problem, we need to use the Hardy-Weinberg principle, which states that the allele and genotype frequencies in a population will remain constant from generation to generation if certain conditions are met. These conditions include random mating, no mutation, no migration, large population size, and no natural selection.

In [None]:
# This will fail due to cuda out of memeory issue. Need to add quantization
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...0002-of-00002.safetensors:   1%|          | 25.1MB / 2.67GB            

  ...0001-of-00002.safetensors:   1%|          | 41.9MB / 4.97GB            

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...phvjzipdy/tokenizer.model: 100%|##########|  500kB /  500kB            

CommitInfo(commit_url='https://huggingface.co/aojieyin/Phi-3-mini-4k-instruct-finetune/commit/08f240fd37177138e57ce42f905e80355592d297', commit_message='Upload tokenizer', commit_description='', oid='08f240fd37177138e57ce42f905e80355592d297', pr_url=None, repo_url=RepoUrl('https://huggingface.co/aojieyin/Phi-3-mini-4k-instruct-finetune', endpoint='https://huggingface.co', repo_type='model', repo_id='aojieyin/Phi-3-mini-4k-instruct-finetune'), pr_revision=None, pr_num=None)