# Fine-Tuning Llama 3.2B Instruct

The lab expects students to fine tune Llama 3.2 1 billion model on PubMed dataset. The provided code outlines the process but students are encouraged to explore further changes.

Fine tuning is a compute intensive process and does require a GPU based environment for execution

The provided code has been tested on T4 GPU.

Dataset Link: https://huggingface.co/datasets/qiaojin/PubMedQA
Related Columns: Question, Long Answer

In [None]:
from huggingface_hub import login
login(token=os.environ(token))

In [3]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, PeftModel, get_peft_model
from trl import SFTTrainer, SFTConfig
import gc

2025-05-01 23:52:35.307597: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746157955.321036 3155415 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746157955.324966 3155415 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746157955.336717 3155415 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746157955.336728 3155415 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746157955.336730 3155415 computation_placer.cc:177] computation placer alr

In [4]:
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
new_model_name = "Llama-3.1-1b-finetuned_vigneshwar2"

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map='cuda',
    torch_dtype=torch.float16
)

In [6]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

In [7]:
def generate_response(model,tokenizer,question):
    device = next(model.parameters()).device
    inputs = tokenizer(question, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=512)

    generated_tokens = outputs[0][len(inputs[0]):]
    answer = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return answer


In [8]:
test_question = "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?"

In [9]:
generate_response(model,tokenizer,test_question)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


' \n## Step 1: Calculate the original value of the house\nThe original value of the house is $80,000.\n\n## Step 2: Calculate the increased value of the house\nThe increased value of the house is 150% of the original value, which is 1.5 times the original value. This means the increased value is 1.5 * $80,000 = $120,000.\n\n## Step 3: Calculate the total value of the house\nThe total value of the house is the sum of the original value and the increased value. Total value = $80,000 + $120,000 = $200,000.\n\n## Step 4: Calculate the profit\nThe profit is the difference between the total value of the house and the original value. Profit = Total value - Original value = $200,000 - $80,000 = $120,000.\n\nThe final answer is: $\\boxed{120000}$'

In [10]:
dataset = load_dataset("qiaojin/PubMedQA",'pqa_artificial',split='train')
dataset = dataset.select(range(2000))

print(f"Dataset features: {dataset.features}")
print(f"Sample:\n{dataset[0]}")

Dataset features: {'pubid': Value(dtype='int32', id=None), 'question': Value(dtype='string', id=None), 'context': Sequence(feature={'contexts': Value(dtype='string', id=None), 'labels': Value(dtype='string', id=None), 'meshes': Value(dtype='string', id=None)}, length=-1, id=None), 'long_answer': Value(dtype='string', id=None), 'final_decision': Value(dtype='string', id=None)}
Sample:
{'pubid': 25429730, 'question': 'Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?', 'context': {'contexts': ['Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated.', 'The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compar

In [11]:
dataset

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 2000
})

In [None]:
dataset[0]

{'pubid': 25429730,
 'question': 'Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?',
 'context': {'contexts': ['Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated.',
   'The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease.',
   'A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD

In [13]:
def format_PubMedQA_prompt(sample, tokenizer):
    context_text = "\n".join(sample['context']['contexts'])
    
    labels_text = ", ".join(sample['context']['labels'])
    meshes_text = ", ".join(sample['context']['meshes'])

    user_content = f"""You are a biomedical researcher analyzing a clinical question.
    Context: {context_text}
    Labels: {labels_text}
    MeSH Terms: {meshes_text}
    Question:{sample['question']}
    Provide me a detailed explanation and your final decision (yes or no)?"""

    assistant_content = f"""Detailed Explanation: {sample['long_answer']}
    Final Decision: {sample['final_decision']}"""

    messages = [
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": assistant_content}
    ]

    formatted_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": formatted_text}


In [14]:
dataset = dataset.map(
    lambda example: format_PubMedQA_prompt(example, tokenizer),
    batched=False,
)

print(f"Dataset features (after formatting): {dataset.features}")
print(f"Sample:\n{dataset[0]['text']}")


Dataset features (after formatting): {'pubid': Value(dtype='int32', id=None), 'question': Value(dtype='string', id=None), 'context': Sequence(feature={'contexts': Value(dtype='string', id=None), 'labels': Value(dtype='string', id=None), 'meshes': Value(dtype='string', id=None)}, length=-1, id=None), 'long_answer': Value(dtype='string', id=None), 'final_decision': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}
Sample:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 May 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

You are a biomedical researcher analyzing a clinical question.
    Context: Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics 

In [15]:
q = """You are a biomedical researcher analyzing a clinical question.
    Context: Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated. The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease. A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD161(+) cells in single cell suspensions through flow cytometry. ILC2 frequencies, measured as a percentage of CD45(+) cells, were compared across CRS phenotype, endotype, inflammatory CRS subtype and other disease characteristics including blood eosinophils, serum IgE, asthma status and nasal symptom score. 35 patients (40% female, age 48 ± 17 years) including 13 with eosinophilic CRS (eCRS), 13 with non-eCRS and 9 controls were recruited. ILC2 frequencies were associated with the presence of nasal polyps (P = 0.002) as well as high tissue eosinophilia (P = 0.004) and eosinophil-dominant CRS (P = 0.001) (Mann-Whitney U). They were also associated with increased blood eosinophilia (P = 0.005). There were no significant associations found between ILC2s and serum total IgE and allergic disease. In the CRS with nasal polyps (CRSwNP) population, ILC2s were increased in patients with co-existing asthma (P = 0.03). ILC2s were also correlated with worsening nasal symptom score in CRS (P = 0.04).
    Labels: BACKGROUND, OBJECTIVE, METHODS, RESULTS
    MeSH Terms: Adult, Aged, Antigens, Surface, Case-Control Studies, Chronic Disease, Eosinophilia, Female, Humans, Hypersensitivity, Immunity, Innate, Immunoglobulin E, Immunophenotyping, Leukocyte Count, Lymphocyte Subsets, Male, Middle Aged, Nasal Mucosa, Nasal Polyps, Neutrophil Infiltration, Patient Outcome Assessment, Rhinitis, Sinusitis, Young Adult
    Question:Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?
    Provide me a detailed explanation and your final decision (yes or no)?"""

In [16]:
print(generate_response(model,tokenizer,q))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.





In [17]:
print(tokenizer(q, return_tensors="pt")["input_ids"].shape)  # check length

torch.Size([1, 619])


In [18]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb):

In [19]:
peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.05,
    r=16,
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
    ],
)

In [20]:
model_for_training = get_peft_model(model, peft_config)

In [21]:
model_for_training

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048)
        (layers): ModuleList(
          (0-15): 16 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

In [22]:
model_for_training.print_trainable_parameters()

trainable params: 3,407,872 || all params: 1,239,222,272 || trainable%: 0.2750


In [23]:
training_arguments = SFTConfig(
    output_dir=new_model_name,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-5,
    lr_scheduler_type = "cosine",
    fp16=True,
    warmup_ratio = 0.03 ,
    report_to="wandb",
    gradient_checkpointing=True,
    push_to_hub=True,
    max_seq_length=1024,
    packing=True,
    dataset_text_field="text",
)

In [24]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_arguments,
    peft_config=peft_config,
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [25]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mvigneshwar2598[0m ([33mvigneshwarr[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
1,2.5741
2,2.4933
3,2.408
4,2.3558
5,2.5392
6,2.4675
7,2.4719
8,2.5352
9,2.4704
10,2.4475


TrainOutput(global_step=67, training_loss=2.296587015265849, metrics={'train_runtime': 908.4291, 'train_samples_per_second': 1.189, 'train_steps_per_second': 0.074, 'total_flos': 6431943910490112.0, 'train_loss': 2.296587015265849})

In [44]:
print(q)

You are a biomedical researcher analyzing a clinical question.
    Context: Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated. The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease. A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD161(+) cells in single cell suspensions through flow cytometry. ILC2 frequencies, measured as a percentage of CD45

In [124]:
peft_model_path = 'vigneshwar-r/Llama-3.1-1b-finetuned_vigneshwar2'
model_new = AutoModelForCausalLM.from_pretrained(
    peft_model_path,
    device_map='cuda',
    torch_dtype=torch.float16
)
model_new = model_new.to("cuda")

In [197]:
q = """You are a biomedical researcher analyzing a clinical question.
    Context: Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated. The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease. A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD161(+) cells in single cell suspensions through flow cytometry. ILC2 frequencies, measured as a percentage of CD45(+) cells, were compared across CRS phenotype, endotype, inflammatory CRS subtype and other disease characteristics including blood eosinophils, serum IgE, asthma status and nasal symptom score. 35 patients (40% female, age 48 ± 17 years) including 13 with eosinophilic CRS (eCRS), 13 with non-eCRS and 9 controls were recruited. ILC2 frequencies were associated with the presence of nasal polyps (P = 0.002) as well as high tissue eosinophilia (P = 0.004) and eosinophil-dominant CRS (P = 0.001) (Mann-Whitney U). They were also associated with increased blood eosinophilia (P = 0.005). There were no significant associations found between ILC2s and serum total IgE and allergic disease. In the CRS with nasal polyps (CRSwNP) population, ILC2s were increased in patients with co-existing asthma (P = 0.03). ILC2s were also correlated with worsening nasal symptom score in CRS (P = 0.04).
    Question:Are group 2 innate lymphoid cells ( ILC2s ) increased in chronic rhinosinusitis with nasal polyps or eosinophilia?
    Provide me a detailed explanation and your final decision (yes or no)?"""

In [200]:
print(generate_response(model_new,tokenizer,q))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 
    What do you suggest as a next step in this study?
    What do you think are the implications of your findings for the pathogenesis of CRS and the treatment of CRS with nasal polyps or eosinophilia?

    Answer: Yes

    Final decision: Yes

    Next step: Further validation of the ILC2 population in CRS patients, including studies on the association between ILC2s and other disease characteristics and the role of ILC2s in the pathogenesis of CRS and CRS with nasal polyps.

    Implications: Increased ILC2s in CRS with nasal polyps could be a marker for the presence of an asthma-like phenotype in CRS, while elevated ILC2s in CRS with eosinophilia may indicate an allergic inflammatory process. The identification of ILC2s as a potential target for therapy in these conditions may lead to the development of novel treatments.


In [204]:
print(generate_response(model,tokenizer,q))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.





    Answer:Yes


    Decision:ILC2s are increased in CRSwNP and eosinophilic CRS, but not in non-eCRS and asthma patients. The increase in ILC2s in CRSwNP is associated with co-existing asthma, suggesting a potential link between ILC2s and airway inflammation. ILC2s may play a role in the development of CRSwNP and the pathogenesis of airway inflammation in CRSwNP patients. The association between ILC2s and eosinophilic CRS suggests that ILC2s may be involved in the pathogenesis of CRSwNP and eosinophilic CRS. The association between ILC2s and nasal symptom score in CRSwNP patients may indicate that ILC2s are involved in the development of nasal symptoms in CRSwNP.
