## BioMistral-7b Fine-Tuning with QLoRA

In [1]:
%%capture
# Import the PyTorch library
import torch

# Get the major and minor version of the current CUDA device (GPU)
major_version, minor_version = torch.cuda.get_device_capability()

# Apply the following if the GPU has Ampere or Hopper architecture (RTX 30xx, RTX 40xx, A100, H100, L40, etc.)
!pip install uv
if major_version >= 8:
    # Install the Unsloth library for Ampere and Hopper architecture from GitHub
    !uv pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git" -q

# Apply the following for older GPUs (V100, Tesla T4, RTX 20xx, etc.)
else:
    # Install the Unsloth library for older GPUs from GitHub
    # !uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -q
    !uv pip install "unsloth[cu75] @ git+https://github.com/unslothai/unsloth.git" -q

# Placeholder statement (does nothing)
pass

# Install the Hugging Face Transformers library from GitHub, which allows native 4-bit loading
!uv pip install "git+https://github.com/huggingface/transformers.git" -q

!uv pip install trl datasets bitsandbytes unsloth_zoo -q

In [2]:
print("Major : ", major_version, "Minor :", minor_version)

Major :  7 Minor : 5


Import unsloth Libraries

In [3]:
from unsloth import FastLanguageModel
from google.colab import userdata

# Get the Hugging Face token from Colab secrets
hf_token = userdata.get('personal_hf')

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "BioMistral/BioMistral-7B",
    max_seq_length = 1024,
    dtype = None,  # Automatically uses float16 on T4
    load_in_4bit = True, # Reduce memory usage using 4-bit quantization (can be set to False to disable)
    token = hf_token # Pass the Hugging Face token
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer.pad_token = tokenizer.eos_token # Mistral often lacks a pad token.
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.2: Fast Mistral patching. Transformers: 4.56.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


pytorch_model.bin:   0%|          | 0.00/14.5G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

BioMistral/BioMistral-7B does not have a padding token! Will use pad_token = <unk>.


## Add LoRA Adapter and update only 1-10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model, # The base model that we want to apply LoRA/QloRA to (e.g., LLaMA, Mistral, etc.)
    r = 16, # (Recommended values include 8, 16, 32, 64, 128, etc.) Rank parameter for LoRA. The smaller this value, the fewer parameters will be modified.
    target_modules=["q_proj", "v_proj"], # Specify the modules to which LoRA will be applied
    # target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Specify the modules to which LoRA will be applied
    lora_alpha = 16, # Alpha parameter for LoRA. This value determines the strength of the applied LoRA.
    lora_dropout = 0.1, # Dropout rate for LoRA. Currently supports only 0. (recommended 0)
    bias = "none", # Currently, only supports bias = "none"
    use_gradient_checkpointing = "unsloth", # Whether to use gradient checkpointing to improve memory efficiency
    random_state = 3407, # Seed value for random number generation
    max_seq_length = 1024, # Maximum sequence length for tokenization
    use_rslora = False, # Whether to use "Rank Stabilized LoRA" (experimental feature).
    use_dora = False, # Whether to use DoRA (a different kind of low-rank adaptation).
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.8.2 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [5]:
model.print_trainable_parameters()


trainable params: 6,815,744 || all params: 7,248,547,840 || trainable%: 0.0940


## Load Dataset

In [13]:
from datasets import load_dataset
import json

# Load the training data of the cleaned version of the medical dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train").select(range(5000))

# Define the function to convert into ChatML-style format
def convert_to_chatml(example):
    return {
        "messages": [
            {
                "role": "user",
                "content": f"{example['Question']}"
            },
            {
                "role": "assistant",
                "content": f"<think>{example['Complex_CoT']}</think>\n\n{example['Response']}"
            }
        ]
    }

# Map the dataset to new format
formatted_dataset = dataset.map(convert_to_chatml)

# Remove original columns (optional)
formatted_dataset = formatted_dataset.remove_columns(["Question", "Complex_CoT", "Response"])

# Save to JSONL
formatted_dataset.to_json("biomistral_chat_format.jsonl", orient="records", lines=True)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

14900624

In [14]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

# Load dataset from JSONL file
new_df = load_dataset("json", data_files="biomistral_chat_format.jsonl", split="train")

# Load tokenizer with chat template
tokenizer = get_chat_template(tokenizer, chat_template="mistral")

# Fix: formatting function must return a dictionary
def formatting_prompts_func(examples):
    return {
        "text": [
            tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False)
            for message in examples["messages"]
        ]
    }

# Apply formatting
formatted_df = new_df.map(formatting_prompts_func, batched=True, remove_columns=["messages"])

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [15]:
test = formatted_df["text"][0]
print(test)

<s>[INST] Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings? [/INST]<think>Okay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?

But wait, there's more. The right lower leg is swollen and tender, which is like waving a big flag for deep vein thrombosis, especially after a long flight or sitting around a lot.

So, now I'm thinking, how could a clot in the leg end up causing issues like weakness or stroke symptoms?

Oh, right! There's this thing called a paradoxical embolism. It can happen if there's some kind of short circuit in the heart - like a hole that shouldn't be there.

Let's put this together: if a blood clot from the leg somehow travels to the left side of the h

## Training Model

In [16]:
# Import the SFTTrainer class from the trl library
from trl import SFTTrainer
from unsloth import is_bfloat16_supported
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size = 2,  # Batch size per device/GPU. Increase for faster training if memory allows.
    gradient_accumulation_steps = 4,  # Number of steps to accumulate gradients before performing an update
    max_grad_norm = 1,                # Clips gradients to prevent exploding gradients. Common values: 1 or 0.5.
    num_train_epochs = 1,             # Number of times the entire dataset is passed through the model.
    max_steps = 60,                  # Set the maximum number of training steps
    warmup_steps = 5,                # Specify the number of warm-up steps
    learning_rate = 1e-5,             # Specify the learning rate (1e-5, 5e-6, 2e-5, 3e-5, 5e-5, 2e-4)
    fp16 = not torch.cuda.is_bf16_supported(), # Set whether to use 16-bit floating-point precision (fp16)
    bf16 = torch.cuda.is_bf16_supported(), # Set whether to use 16-bit floating-point precision (bf16)
    logging_steps = 1,                # Specify the number of steps logging
    optim = "adamw_8bit",             # Specify the optimizer (here using 8-bit AdamW)
    weight_decay = 0.1,               # L2 regularization strength. Helps prevent overfitting. Common values: 0.01 or 0.1
    lr_scheduler_type = "cosine",     # Learning rate scheduler (linear/cosine/polynomial)
    group_by_length = True,           # Saves memory and speeds up training considerably
    seed = 3407,                      # Random seed for reproducibility.
    output_dir = "./results",           # Directory where model checkpoints and logs will be saved.
    report_to="tensorboard",                 # or use "tensorboard" if you want TB instead
    logging_dir = "./logs",     # Directory where logs will be saved.
)

trainer = SFTTrainer(
    model = model,                    # The pretrained model to be fine-tuned.
    tokenizer = tokenizer,            # Specify the tokenizer for the model
    train_dataset = formatted_df,     # Specify the training dataset (must be formatted properly).
    dataset_text_field = "text",      # Specify the text field in the dataset
    max_seq_length = 1024,            # Maximum sequence length for tokenization
    dataset_num_proc = 2,             # Number of processes to use for dataset preprocessing.
    packing = False,                  # If True, packs multiple samples into a single sequence to speed up training.
                                      # Recommended for short texts. Set to False for longer sequences or clarity.
    args = training_args,             # Specify the training arguments
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [17]:
import os
os.environ["WANDB_DISABLED"] = "true"

trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 6,815,744 of 7,248,547,840 (0.09% trained)


Step,Training Loss
1,4.064
2,4.1968
3,3.7873
4,3.6207
5,3.381
6,3.5708
7,3.271
8,3.3003
9,3.2193
10,3.3324


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


TrainOutput(global_step=60, training_loss=2.3068473637104034, metrics={'train_runtime': 952.0774, 'train_samples_per_second': 0.504, 'train_steps_per_second': 0.063, 'total_flos': 1.537221043961856e+16, 'train_loss': 2.3068473637104034})

In [None]:
from transformers import pipeline

# Create a generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

# Format the prompt using chat template (if available)
prompt = tokenizer.apply_chat_template([
    {
        "role": "user",
        "content": (
            "A patient presents with persistent cough, night sweats, and weight loss. "
            "What is the most likely diagnosis?"
        )
    }
], tokenize=False, add_generation_prompt=True)

# Run the model
output = pipe(prompt, max_new_tokens=1024, do_sample=True, temperature=0.7)

# Show result
print(output[0]['generated_text'])


Device set to use cuda:0


<s>[INST] A patient presents with persistent cough, night sweats, and weight loss. What is the most likely diagnosis? [/INST] The most likely diagnosis is pulmonary tuberculosis (TB).


Save unsloth model

In [None]:
model.save_pretrained("outputs/peft_model")
tokenizer.save_pretrained("outputs/peft_model")

('outputs/peft_model/tokenizer_config.json',
 'outputs/peft_model/special_tokens_map.json',
 'outputs/peft_model/chat_template.jinja',
 'outputs/peft_model/tokenizer.model',
 'outputs/peft_model/added_tokens.json',
 'outputs/peft_model/tokenizer.json')

In [None]:
# from unsloth import FastLanguageModel
# model = FastLanguageModel.merge_peft_lora(model)

model.push_to_hub_gguf(
    repo_id="PradeepBodhi/BioMistral_Lora_Fine-Tuned_test",
    tokenizer=tokenizer,
    quantization_method="q4_k_m",
    token=hf_token
)


Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 0.92 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  0%|          | 0/32 [00:00<?, ?it/s]

In [None]:
model.push_to_hub(
    repo_id = "PradeepBodhi/BioMistral_Lora_Fine-Tuned",
    tokenizer = tokenizer,
    token = hf_token
    )

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...p2x302wu1/adapter_model.safetensors:   2%|2         |  565kB / 27.3MB            

Saved model to https://huggingface.co/PradeepBodhi/BioMistral_Lora_Fine-Tuned
