In [1]:
pip install unsloth transformers trl

Collecting unsloth
  Downloading unsloth-2025.5.9-py3-none-any.whl.metadata (47 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/47.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.1/47.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.18.1-py3-none-any.whl.metadata (11 kB)
Collecting unsloth_zoo>=2025.5.11 (from unsloth)
  Downloading unsloth_zoo-2025.5.11-py3-none-any.whl.metadata (8.1 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.30-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.22-py3-none-any.whl.metadata (10 kB)
Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting datasets>=3.4.1 (from unsloth)
  D

In [2]:
import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048, #here we are restrciting context window size to 2048, it can go upto 128k
    load_in_4bit=True #Since we want to use QLoRA, I chose the pre-quantized unsloth/Meta-Llama-3.2-3B-Instruct
                      #This 4-bit precision version of meta-llama/Meta-Llama-3.2-3B is significantly smaller and faster to download.
 )

==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

In [4]:
model = FastLanguageModel.get_peft_model(
    model, r=16,
    #Rank (r), which determines LoRA matrix size. Rank typically starts at 8 but can go up to 256.
    #Higher ranks can store more information but increase the computational and memory cost of LoRA. We set it to 16 here.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

Unsloth 2025.5.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [5]:
med_model = FastLanguageModel.get_peft_model(
    model, r=16,
    #Rank (r), which determines LoRA matrix size. Rank typically starts at 8 but can go up to 256.
    #Higher ranks can store more information but increase the computational and memory cost of LoRA. We set it to 16 here.
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

Unsloth: Already have LoRA adapters! We shall skip this step.


In [6]:
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

In [7]:
dataset = load_dataset("mlabonne/FineTome-100k", split="train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [8]:
dataset = standardize_sharegpt(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

In [9]:
dataset

Dataset({
    features: ['conversations', 'source', 'score'],
    num_rows: 100000
})

In [10]:
dataset[2]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions.\n\nFurthermore, discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. Finally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions.',
   'role': 'user'},
  {'content': 'Boolean operators are logical operators used to combine or manipulate boolean values in programming. They allow you to perform comparisons and create complex logical expressions. The three main boolean operators are:\n\n1. AND operator (&&): Returns true if both operands are true. Otherwise, it returns false. For example:\n   - `true && true` returns true\n   - `true && false`

In [11]:
dataset = dataset.map(
    lambda examples: {
        "text": [
            tokenizer.apply_chat_template(convo, tokenize=False)
            for convo in examples["conversations"]
        ]#Once our instruction-answer pairs are parsed, we want to reformat them to follow a chat template. Chat templates are a way to structure conversations between users and models.
        #They typically include special tokens to identify the beginning and the end of a message, who's speaking, etc. Base models don't have chat templates so we can choose any: ChatML, Llama3, Mistral, etc. In the open-source community,
        #the ChatML template (originally from OpenAI) is a popular option. It simply adds two special tokens (<|im_start|> and <|im_end|>) to indicate who's speaking
    },
    batched=True
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [12]:
dataset

Dataset({
    features: ['conversations', 'source', 'score', 'text'],
    num_rows: 100000
})

In [13]:
dataset[2]

{'conversations': [{'content': 'Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions.\n\nFurthermore, discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. Finally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions.',
   'role': 'user'},
  {'content': 'Boolean operators are logical operators used to combine or manipulate boolean values in programming. They allow you to perform comparisons and create complex logical expressions. The three main boolean operators are:\n\n1. AND operator (&&): Returns true if both operands are true. Otherwise, it returns false. For example:\n   - `true && true` returns true\n   - `true && false`

In [21]:
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args=TrainingArguments(
        per_device_train_batch_size=2, #more the better, it will load two examples from the dataset, that means only 2 examples/step.
        gradient_accumulation_steps=4, #the model will perform a forward pass, calculate. the gradient, and then a backward pass and cal. The gradient and repeat the entire thing 4 times, as the step is 4.
        warmup_steps=100,
        max_steps=500, #	1500 steps × 2 examples ≈ 3000 samples → <= 1/2 epoch on this corpus. Likely underfits.
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),#always use bf16 it's faster and efficient otherwise fp16, it's just basically to load and train the weights.
        logging_steps=10,
        output_dir="outputs"
        #not putting the optimizer - paged-adamw-8bit, see what it does by default.
    ),
)

###Medical-o1-reasoning dataset

In [15]:
med_ds = load_dataset(
    "FreedomIntelligence/medical-o1-reasoning-SFT",
    name="en",                 # optional → filters to English subset
    split="train",
)
print(med_ds.features)
print(med_ds[0])


README.md:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/19704 [00:00<?, ? examples/s]

{'Question': Value(dtype='string', id=None), 'Complex_CoT': Value(dtype='string', id=None), 'Response': Value(dtype='string', id=None)}
{'Question': 'Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?', 'Complex_CoT': "Okay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?\n\nBut wait, there's more. The right lower leg is swollen and tender, which is like waving a big flag for deep vein thrombosis, especially after a long flight or sitting around a lot.\n\nSo, now I'm thinking, how could a clot in the leg end up causing issues like weakness or stroke symptoms?\n\nOh, right! There's this thing called a paradoxical embolism. It can happen if there's some kind of short circui

In [16]:
tokenizer_med = get_chat_template(
    tokenizer,                           # the same tokenizer already loaded
    chat_template="llama-3.1",
)

def build_messages(example, keep_cot=False):
    # system prompt (optional – tweak to steer style or include safety rules)
    messages = [ #The standard chat template used by Hugging Face models, including Llama-3.1, expects
                #the message format to be a list of dictionaries with "role" and "content" keys.
                #The build_messages function, however, creates dictionaries with "from" and "value" keys.
        {"role": "system",
         "content": "You are a meticulous medical reasoning assistant."}
    ]

    # user question
    messages += [{"role": "human", "content": example["Question"]}]

    # assistant answer (with or without CoT)
    if keep_cot and example["Complex_CoT"]:
        combined = f"{example['Complex_CoT'].strip()}\n\nFinal answer: {example['Response']}"
    else:
        combined = example["Response"]

    messages += [{"role": "gpt", "content": combined}]
    return {"text":
        tokenizer.apply_chat_template(messages,
                                      tokenize=False,
                                      add_generation_prompt=False)
    }

med_llama = med_ds.map(build_messages,
                   fn_kwargs={"keep_cot": False},   # flip to False to drop CoT
                   remove_columns=med_ds.column_names)  # keeps memory lean

Map:   0%|          | 0/19704 [00:00<?, ? examples/s]

In [19]:
med_trainer = SFTTrainer(
    model = med_model,
    train_dataset = med_llama,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args=TrainingArguments(
        per_device_train_batch_size=1, #more the better, it will load two examples from the dataset, that means only 2 examples/step.
        gradient_accumulation_steps=4, #the model will perform a forward pass, calculate. the gradient, and then a backward pass and cal. The gradient and repeat the entire thing 4 times, as the step is 4.
        warmup_steps=100,#Warm-up ≈ 6 % of 1500. If you raise max_steps keep warm-up at ~1–5 %
        max_steps=500, #	1500 steps × 2 examples ≈ 3000 samples → <= 1/2 epoch on this corpus. Likely underfits.
        #OK if you trimmed the Chain-of-Thought; keep an eye on truncation. With CoT included,
        #some rows creep past 3k tokens—consider 4096 (needs more VRAM) or drop the CoT.
        learning_rate=2e-4,#Reasonable for LoRA/QLoRA. If you bump batch size you can edge toward 3 e-4.
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),#always use bf16 it's faster and efficient otherwise fp16, it's just basically to load and train the weights.
        logging_steps=10,#Very chatty → slower. Typical range is 10–50.
        optim = 'paged_adamw_8bit',#Switching to adamw_8bit (or paged_adamw_8bit if we're memory-bound) saves VRAM and matches the Uns­loth quick-start.
        weight_decay=0.01, #0 → no regularisation. 0.01 is a safe default for LoRA.
        output_dir="med_3b_qlora"
    ),
)

In [20]:
med_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 19,704 | Num Epochs = 1 | Total steps = 500
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)


Step,Training Loss
10,2.3257
20,2.1153
30,1.7198
40,1.3871
50,1.3147
60,1.2387
70,1.2419
80,1.2312
90,1.2164
100,1.1986


TrainOutput(global_step=500, training_loss=1.179920892715454, metrics={'train_runtime': 1483.2567, 'train_samples_per_second': 1.348, 'train_steps_per_second': 0.337, 'total_flos': 8186276440756224.0, 'train_loss': 1.179920892715454})

In [30]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 500
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)


RuntimeError: PassManager::run failed

In [24]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
model.save_pretrained("llama-3.2-3b-finetome100k_model")

In [26]:
med_model.save_pretrained("llama-3.2-3b-medical-o1-reasoning_model")

In [28]:
inference_model, inference_tokenizer = FastLanguageModel.from_pretrained(
    model_name="/content/drive/MyDrive/llama-3.2-3b-medical-o1-reasoning_model",
    max_seq_length=2048,
    load_in_4bit=True
)

==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [29]:
text_prompts = [
    "Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?"
]

for prompt in text_prompts:
  formatted_prompt = inference_tokenizer.apply_chat_template([{
      "role": "user",
      "content": prompt
      }], tokenize=False)

  model_inputs = inference_tokenizer(formatted_prompt, return_tensors="pt").to("cuda") # we want to convert the prompt into tensors suing the inference tokenizer.
  generated_ids = inference_model.generate(
      **model_inputs,
      max_new_tokens=512,
      temperature=0.65,
      do_sample=True, # so that model takes different samples at a time.
      pad_token_id=inference_tokenizer.pad_token_id
  )
  response = inference_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] #skip_special_token- we don't want any pad token and end_token, so skip=True.
  print(response)

system

Cutting Knowledge Date: December 2023
Today Date: 31 May 2025

user

Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?gpt

Based on the symptoms you've described—sudden weakness in the left arm and leg, recent long-distance travel, and a swollen and tender right lower leg—the most likely cardiac abnormality is atrial fibrillation. This condition can lead to a blood clot forming in the right lower extremity, known as a deep vein thrombosis (DVT), which can cause the symptoms you're experiencing. Atrial fibrillation is a common cause of DVT, especially in individuals who have recently traveled, as it can lead to blood stasis, increasing the risk of clot formation. The sudden onset of weakness in both the left arm and leg could also be related to a stroke, potentiall

In [None]:
|