**Step-1: Install**

In [1]:
!pip install -q "transformers>=4.44.0" "accelerate>=0.33.0" \
  bitsandbytes "peft>=0.11.0" "datasets>=2.20.0" huggingface_hub

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h

**Check GPU**

In [2]:
!nvidia-smi

Wed Nov 26 09:01:33 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   50C    P8             12W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

**Step-2: Login to Hugging Face**

In [3]:
from huggingface_hub import login

login()  # paste your hf_... token here when asked

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**Step-3: Load Llama-3.2-1B-Instruct with 4-bit quantization (QLoRA base)**

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "meta-llama/Llama-3.2-1B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,  # safer for Colab T4
)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    use_fast=False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
)


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

**Step-4: Load Dataset**

In [5]:
from datasets import load_dataset

data_path = "/content/optimalmd_qa.jsonl"
raw_dataset = load_dataset("json", data_files=data_path)["train"]
print(raw_dataset[0])

Generating train split: 0 examples [00:00, ? examples/s]

{'instruction': 'What is OptimalMD in simple terms?', 'input': '', 'output': 'OptimalMD is a private membership that gives individuals and families low cost access to virtual doctors, mental health support, prescriptions and diagnostic labs for a flat monthly fee, without using traditional health insurance.'}


***Step-5: Dataset Preprocessing to text format***

In [6]:
def make_text(example):
    # Build the user message
    if example.get("input") and example["input"]:
        user_content = f"Instruction: {example['instruction']}\n\nInput: {example['input']}"
    else:
        user_content = example["instruction"]

    messages = [
        {
            "role": "system",
            "content": (
                "You are the OptimalMD virtual assistant. "
                "Answer in clear, friendly US English. Explain OptimalMD memberships, "
                "medications, labs, broker program and team info. "
                "You DO NOT give individual medical, legal or financial advice."
            ),
        },
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": example["output"]},
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = raw_dataset.map(make_text)
print(dataset[0]["text"][:500])

Map:   0%|          | 0/212 [00:00<?, ? examples/s]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

You are the OptimalMD virtual assistant. Answer in clear, friendly US English. Explain OptimalMD memberships, medications, labs, broker program and team info. You DO NOT give individual medical, legal or financial advice.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is OptimalMD in simple terms?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

OptimalMD is


**Step-6: Tokenize for causal language modeling**

In [8]:
from transformers import AutoTokenizer

base_model_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    use_fast=False,
)

# 👇 Add this fix so padding works
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id


# 🔹 Your tokenize + map code (unchanged except now it works)
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=1024,
        truncation=True,
        padding="max_length",  # now valid because pad_token is set
    )

tokenized = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names,
)

# Split small eval set (5%)
tokenized = tokenized.train_test_split(test_size=0.05, seed=42)
train_dataset = tokenized["train"]
eval_dataset = tokenized["test"]

print(len(train_dataset), len(eval_dataset))


Map:   0%|          | 0/212 [00:00<?, ? examples/s]

201 11


**Step- 7: Add LoRA with PEFT**

In [9]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 11,272,192 || all params: 1,247,086,592 || trainable%: 0.9039


**Step-8: Set up Trainer (Vanila)**

In [10]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

output_dir = "/content/llama32_1b_optimalmd_lora"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    fp16=True,          # good for T4
    report_to=[],       # no wandb, etc.
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # causal LM
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)


**Step-9: Train the LoRA**

In [11]:
trainer.train()

Step,Training Loss
10,3.0133
20,1.4161
30,1.2278


TrainOutput(global_step=39, training_loss=1.7218484389476287, metrics={'train_runtime': 266.8906, 'train_samples_per_second': 2.259, 'train_steps_per_second': 0.146, 'total_flos': 3647104434044928.0, 'train_loss': 1.7218484389476287, 'epoch': 3.0})

**Step-10: Save the LoRA adapter folder**

In [12]:
save_path = "/content/llama32_1b_optimalmd_lora"

trainer.model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

import os
print("Saved files:", os.listdir(save_path))

Saved files: ['README.md', 'checkpoint-13', 'checkpoint-39', 'adapter_model.safetensors', 'special_tokens_map.json', 'adapter_config.json', 'chat_template.jinja', 'checkpoint-26', 'tokenizer.json', 'tokenizer_config.json']


**Step-11: Zip & download the adapter folder from Colab**

In [13]:
!cd /content && zip -r llama32_1b_optimalmd_lora.zip llama32_1b_optimalmd_lora

  adding: llama32_1b_optimalmd_lora/ (stored 0%)
  adding: llama32_1b_optimalmd_lora/README.md (deflated 65%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/ (stored 0%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/README.md (deflated 65%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/training_args.bin (deflated 53%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/adapter_model.safetensors (deflated 8%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/special_tokens_map.json (deflated 63%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/rng_state.pth (deflated 26%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/adapter_config.json (deflated 58%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/chat_template.jinja (deflated 71%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/trainer_state.json (deflated 55%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/optimizer.pt (deflated 8%)
  adding: llama32_1b_optimalmd_lora/checkpoint-13/scaler.pt (deflated 64%)
  add

***Testing***

In [14]:
save_path = "/content/llama32_1b_optimalmd_lora"

In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
save_path = "/content/llama32_1b_optimalmd_lora"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=False)

# padding fix again
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, save_path)
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 2048)
        (layers): ModuleList(
          (0-15): 16 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [16]:
device = next(model.parameters()).device

messages = [
    {
        "role": "system",
        "content": (
            "You are the OptimalMD virtual assistant. "
            "Answer in clear, friendly US English. Do not give personal medical advice."
        ),
    },
    {
        "role": "user",
        "content": "Explain the difference between OptimalMD RX, OptimalMD Prime and OptimalMD Impact.",
    },
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # now we want the model to continue as assistant
)

inputs = tokenizer(
    prompt,
    return_tensors="pt",
).to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        top_p=0.9,
        temperature=0.2,
    )

generated = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

You are the OptimalMD virtual assistant. Answer in clear, friendly US English. Do not give personal medical advice.user

Explain the difference between OptimalMD RX, OptimalMD Prime and OptimalMD Impact.assistant

OptimalMD RX is a virtual clinic that provides access to a wide range of virtual providers for common health issues, including labs, medications and routine care. It is designed for people who need routine care but may not have a primary care provider or insurance that covers virtual visits. OptimalMD Prime is a more comprehensive virtual health program that includes virtual visits, medications, labs and broker support. It is designed for people who need more comprehensive care and want to avoid traditional insurance plans. OptimalMD Impact is a virtual clinic that focuses on urgent care and emergency services, often used for people who need immediate attention but don’t have insurance that covers virtual 

In [17]:
test_questions = [
    "What is OptimalMD in simple terms?",
    "Who is the owner of OptimalMD?",
    "Who trained you?",
    "How do the free labs work for members?",
]

for q in test_questions:
    messages = [
        {"role": "system", "content": "You are the OptimalMD virtual assistant."},
        {"role": "user", "content": q},
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=True,
            top_p=0.9,
            temperature=0.2,
        )
    print("=" * 80)
    print("Q:", q)
    print(tokenizer.decode(output_ids[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Q: What is OptimalMD in simple terms?
system

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

You are the OptimalMD virtual assistant.user

What is OptimalMD in simple terms?assistant

OptimalMD is a virtual health platform that helps you manage your routine care, medications, labs, broker program and team info all in one place. It’s like having a personal virtual assistant that can answer questions, schedule appointments, send prescriptions and even help you find local doctors and pharmacies. It’s free for most services and is designed to make healthcare easier and more affordable for everyone.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Q: Who is the owner of OptimalMD?
system

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

You are the OptimalMD virtual assistant.user

Who is the owner of OptimalMD?assistant

OptimalMD is owned by OptimalMD LLC, a US-based healthcare technology company that aims to make high-quality virtual care accessible to everyone. The company was founded by Dr. Eric Schweikart, who is a physician and entrepreneur with a background in healthcare policy and technology. Dr. Schweikart is the CEO and President of OptimalMD.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Q: Who trained you?
system

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

You are the OptimalMD virtual assistant.user

Who trained you?assistant

OptimalMD was trained on a large dataset of clinical information and algorithms that were developed by the OptimalMD team. The training data includes thousands of medical cases, diagnoses, medications, labs, and treatments. This training allows me to understand the nuances of human health and provide accurate and helpful responses. My training was based on a combination of natural language processing (NLP) and machine learning algorithms that enable me to understand and respond to a wide range of questions and scenarios.
Q: How do the free labs work for members?
system

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

You are the OptimalMD virtual assistant.user

How do the free labs work for members?assistant

The free labs are designed to help members understand common health conditions and symptoms, and to g