### LLaMA Supervised Fine-Tuning

This document will take the answers of GPT-4o on the Kababutare Medical Dataset and then fine-tune the LLaMA Model on those answers.

The purpose of this exercise is to test whether the LLaMA fine-tuning is able to distill the knowledge of GPT-4o and improve the performance on the open-ended question/answering related to healthcare dataset

In [2]:
import os

In [3]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [4]:
import pandas as pd
import json
import torch
import pickle
from unsloth import FastLanguageModel
from datasets import Dataset
from tqdm  import tqdm

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!


#### Reading the Question and Answer Pairs from Phase 1 of GPT-4o

In [5]:
gpt_inf_data_phase1 = pd.DataFrame()
ques_list = []
gpt_resp_list = []

with open('phase1_kabatubare_medical/kabatubare_medical_gpt4omini_qa_pairs.jsonl', 'rb') as file:
    for line in file:
        json_object = json.loads(line)
        ques_list.append(json_object['Question'])
        gpt_resp_list.append(json_object['Answer'])

gpt_inf_data_phase1['question'] = ques_list
gpt_inf_data_phase1['gpt_response_base'] = gpt_resp_list
gpt_inf_data_phase1

Unnamed: 0,question,gpt_response_base
0,my 5 1/2-year-old son displays adhd symptoms f...,It’s important to remember that only a qualifi...
1,my son has add and mild autism. he has been su...,Weight management can be a concern for childre...
2,my son is 13 and is depressed. he has been tak...,I'm really sorry to hear that your son is feel...
3,my 17-year-old has stopped taking concerta aft...,"When a person, especially a teenager, stops ta..."
4,i've been taking respa-ar for allergies. i can...,Resp-A-R is a combination medication commonly ...
...,...,...
23432,how can accidental of acetaminophen overdose b...,Accidental acetaminophen overdose is a signifi...
23433,what should i do if i take an overdose of maxalt?,If you suspect that you have taken an overdose...
23434,what do i do in case of an overdose of relpax?,If you suspect an overdose of Relpax (eletript...
23435,is overdose with acetaminophen usually acciden...,Overdoses of acetaminophen (also known as para...


Create the HuggingFace Dataset from Pandas Dataframe

In [5]:
dataset = Dataset.from_pandas(gpt_inf_data_phase1)
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'gpt_response'],
        num_rows: 21093
    })
    test: Dataset({
        features: ['question', 'gpt_response'],
        num_rows: 2344
    })
})

### Inference

In [6]:
# full_model_path = "./llama32-sft-full-kabatubare"
peft_model_path = "./llama32-sft-peft-kabatubare" #use for LoRA based fine-tuning

In [7]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = peft_model_path,
    max_seq_length = 4096,
    load_in_4bit = False, # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    dtype=None, #None for auto-detection. Can be torch.bfloat16 or torch.float16 (will be automatically detected)
    device_map="auto"
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.2.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.413 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.3.19 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Implementing sample-by-sample inference. (Batch Inference doesn't work well for fine-tuned model adapters as responses like `P P P P` are being produced)

In [None]:
def get_llama_response_ft(question_input: str):
    
    llama_input = [{"role": "system", "content": "You are a medical knowledge assistant trained to provide information and guidance on various health-related topics."},
                    {"role": "user", "content": question_input}]

    prompt = tokenizer.apply_chat_template(llama_input, tokenize=False, add_generation_prompt=True)
    
    inputs = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt").to(model.device)
    temp_resp = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)
    
    outputs = model.generate(
        **inputs, 
        max_new_tokens=4096,
        num_return_sequences=1
    )

    resp = tokenizer.decode(outputs[0], skip_special_tokens=True)
    resp = resp[len(temp_resp):] #getting only the response part (i.e., assistant)
    
    return resp

In [None]:
# Implementing the Unsloth Fast Inference
FastLanguageModel.for_inference(model)

llama_responses_ft = []
for index, row in tqdm(gpt_inf_data_phase1.iterrows(), total=len(gpt_inf_data_phase1)):
    question_input = row['gpt_response']
    llama_resp = get_llama_response_ft(question_input)
    llama_responses_ft.append(llama_resp)

with open('phase2_kabatubare_medical/llama_responses_ft.pkl', 'wb') as file:
    pickle.dump(llama_responses_ft, file)

In [None]:
with open('phase2_kabatubare_medical/llama_responses_ft.pkl', 'rb') as file:
    llama_responses_ft = pickle.load(file)