# ü©∫ Domain-Specific Assistant via LLM Fine-Tuning (Orthopedic Medical Assistant)

**Domain:** Orthopedics / musculoskeletal conditions (fractures, joints, ligaments, rehabilitation)  
**Dataset:** `medalpaca/medical_meadow_medical_flashcards` (Hugging Face)  
**Model:** `TinyLlama/TinyLlama-1.1B-Chat-v1.0` using **QLoRA** (4-bit) + **LoRA**

This Colab notebook includes:
- Dataset loading + orthopedic filtering
- Preprocessing + tokenization
- **3 experiments** (hyperparameter tuning) + comparison table
- Evaluation (eval loss + ROUGE-L + qualitative tests)
- Base vs fine-tuned comparison
- Gradio UI demo


## 1) Install dependencies
*(Install libraries for fine-tuning, evaluation, and UI.)*

In [None]:
!pip install -q -U transformers datasets peft accelerate bitsandbytes rouge-score sacrebleu gradio

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m10.4/10.4 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m515.2/515.2 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.7/60.7 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m100.8/100.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.2/2

## 2) Imports + GPU check
*(Import required modules and confirm GPU availability.)*

In [None]:
import os, re, inspect
import torch
import numpy as np
import pandas as pd

from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,
    TrainingArguments, Trainer, DataCollatorForLanguageModeling
)
from peft import LoraConfig, TaskType, prepare_model_for_kbit_training, get_peft_model
from rouge_score import rouge_scorer
import gradio as gr

print('Torch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))


Torch: 2.9.0+cu128
CUDA available: True
GPU: Tesla T4


## 3) Project definition (Rubric: Domain alignment)
*(Purpose, users, and why fine-tuning is needed.)*

- **Purpose:** Answer orthopedic/musculoskeletal study questions in a structured flashcard style.
- **Users:** Students/trainees revising fractures, joint injuries, and rehabilitation basics.
- **Why fine-tune:** Base LLMs can be generic; fine-tuning aligns outputs to domain-specific Q&A style.


## 4) Load dataset
*(Load Medical Meadow medical flashcards from Hugging Face.)*

In [None]:
dataset = load_dataset('medalpaca/medical_meadow_medical_flashcards', split='train')
df = pd.DataFrame(dataset)

# Standardize common column name variants
if 'instruction' in df.columns and 'output' in df.columns:
    df = df.rename(columns={'output': 'response'})
elif 'question' in df.columns and 'answer' in df.columns:
    df = df.rename(columns={'question': 'instruction', 'answer': 'response'})
elif 'input' in df.columns and 'output' in df.columns:
    df = df.rename(columns={'input': 'instruction', 'output': 'response'})

assert {'instruction','response'}.issubset(df.columns), df.columns.tolist()
print('Original dataset size:', len(df))
df.head(3)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

medical_meadow_wikidoc_medical_flashcard(‚Ä¶):   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33955 [00:00<?, ? examples/s]

Original dataset size: 33955


Unnamed: 0,input,response,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully


## 5) Filter orthopedic content
*(Narrow to orthopedic/musculoskeletal topics for stronger domain focus.)*

In [None]:
orthopedic_keywords = [
    'fracture','bone','orthopedic','orthopaedic','musculoskeletal',
    'cast','splint','dislocation','sprain','strain','ligament','tendon',
    'joint','cartilage','osteoporosis','arthritis',
    'hip','knee','shoulder','elbow','wrist','ankle','spine','vertebra',
    'femur','tibia','fibula','radius','ulna','rehabilitation','physical therapy'
]

def contains_keywords(text):
    if pd.isna(text) or text is None:
        return False
    t = str(text).lower()
    return any(k in t for k in orthopedic_keywords)

mask = df['instruction'].apply(contains_keywords) | df['response'].apply(contains_keywords)
df_filtered = df[mask].copy().reset_index(drop=True)

print('Filtered dataset size:', len(df_filtered), 'out of', len(df))
print('Percent retained:', round(100*len(df_filtered)/len(df), 2), '%')

# If too small, fall back to full dataset
df_use = df_filtered if len(df_filtered) >= 500 else df.copy()
if len(df_use) != len(df_filtered):
    print('Filtered set < 500 examples; using full dataset instead.')
print('Final dataset size used:', len(df_use))
df_use.head(3)


Filtered dataset size: 2892 out of 33955
Percent retained: 8.52 %
Final dataset size used: 2892


Unnamed: 0,input,response,instruction
0,What are the conditions that can be suggested ...,The presence of monosodium urate crystals in j...,Answer this question truthfully
1,What conditions are suggested by high ESR/CK a...,High ESR/CK and bilateral proximal muscle weak...,Answer this question truthfully
2,What is Œ≤-thalassemia major and how does it af...,Œ≤-thalassemia major is a specific type of Œ≤-th...,Answer this question truthfully


## 6) Preprocess text
*(Clean/normalize text, remove duplicates, and filter very short entries.)*

In [None]:
def clean_text(text):
    if pd.isna(text) or text is None:
        return ''
    text = str(text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)]', '', text)
    return text.strip()

def normalize_text(text):
    if pd.isna(text) or text is None:
        return ''
    text = str(text)
    return (text
            .replace('\u2019', "'")
            .replace('\u201c', '"')
            .replace('\u201d', '"')
            .replace('\u2013', '-')
           )

def preprocess_dataset(df_in, min_length=10, max_length=512):
    df2 = df_in.copy()
    df2['instruction'] = df2['instruction'].apply(lambda x: normalize_text(clean_text(x)))
    df2['response'] = df2['response'].apply(lambda x: normalize_text(clean_text(x)))
    before = len(df2)
    df2 = df2[(df2['instruction'].str.len() >= min_length) & (df2['response'].str.len() >= min_length)]
    df2 = df2[(df2['instruction'].str.len() <= max_length) & (df2['response'].str.len() <= max_length)]
    df2 = df2.drop_duplicates(subset=['instruction','response']).reset_index(drop=True)
    after = len(df2)
    print('Before preprocessing:', before)
    print('After preprocessing:', after)
    print('Removed:', before - after)
    return df2

df_processed = preprocess_dataset(df_use, min_length=10, max_length=512)
print('\nInstruction length stats:\n', df_processed['instruction'].str.len().describe())
print('\nResponse length stats:\n', df_processed['response'].str.len().describe())
df_processed.head(3)


Before preprocessing: 2892
After preprocessing: 1145
Removed: 1747

Instruction length stats:
 count    1145.0
mean       31.0
std         0.0
min        31.0
25%        31.0
50%        31.0
75%        31.0
max        31.0
Name: instruction, dtype: float64

Response length stats:
 count    1145.000000
mean      161.791266
std       116.375467
min        11.000000
25%        90.000000
50%       120.000000
75%       177.000000
max       511.000000
Name: response, dtype: float64


Unnamed: 0,input,response,instruction
0,What are the conditions that can be suggested ...,The presence of monosodium urate crystals in j...,Answer this question truthfully
1,What conditions are suggested by high ESR/CK a...,High ESRCK and bilateral proximal muscle weakn...,Answer this question truthfully
2,What is Œ≤-thalassemia major and how does it af...,Œ≤-thalassemia major is a specific type of Œ≤-th...,Answer this question truthfully


## 7) Train/validation split
*(Shuffle and split into 90% train and 10% validation.)*

In [None]:
df_shuffled = df_processed.sample(frac=1, random_state=42).reset_index(drop=True)
val_size = int(len(df_shuffled) * 0.1)

val_df = df_shuffled.iloc[:val_size].reset_index(drop=True)
train_df = df_shuffled.iloc[val_size:].reset_index(drop=True)

print(f'Train: {len(train_df)} examples')
print(f'Validation: {len(val_df)} examples')


Train: 1031 examples
Validation: 114 examples


## 8) Tokenizer + formatting
*(Load tokenizer and format instruction-response pairs in Alpaca style.)*

In [None]:
model_name = 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
print('Tokenizer loaded:', model_name)
print('Vocab size:', tokenizer.vocab_size)

def format_instruction_response(instruction, response):
    return (
        'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
        '### Instruction:\n'
        f'{instruction}\n\n'
        '### Response:\n'
        f'{response}'
    )




config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Tokenizer loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Vocab size: 32000


## 9) Convert to HF Dataset + tokenize
*(Tokenize and set labels for causal LM training.)*

In [None]:
def tokenize_function(examples, max_length=512):
    texts = [format_instruction_response(i, r) for i, r in zip(examples['instruction'], examples['response'])]
    tok = tokenizer(texts, truncation=True, padding='max_length', max_length=max_length)
    tok['labels'] = tok['input_ids'].copy()
    return tok

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

train_tokenized = train_dataset.map(lambda x: tokenize_function(x, 512), batched=True, remove_columns=train_dataset.column_names)
val_tokenized = val_dataset.map(lambda x: tokenize_function(x, 512), batched=True, remove_columns=val_dataset.column_names)

dataset_dict = DatasetDict({'train': train_tokenized, 'validation': val_tokenized})
print('Tokenized train:', len(dataset_dict['train']))
print('Tokenized validation:', len(dataset_dict['validation']))


Map:   0%|          | 0/1031 [00:00<?, ? examples/s]

Map:   0%|          | 0/114 [00:00<?, ? examples/s]

Tokenized train: 1031
Tokenized validation: 114


## 10) TrainingArguments compatibility helper
*(Auto-handle `eval_strategy` vs `evaluation_strategy` based on your transformers version.)*

In [None]:
def make_training_args(**kwargs):
    sig = inspect.signature(TrainingArguments.__init__)
    params = sig.parameters.keys()
    if 'evaluation_strategy' in params and 'eval_strategy' in kwargs:
        kwargs['evaluation_strategy'] = kwargs.pop('eval_strategy')
    if 'evaluation_strategy' not in params and 'evaluation_strategy' in kwargs:
        kwargs['eval_strategy'] = kwargs.pop('evaluation_strategy')
    return TrainingArguments(**kwargs)

print('Supports evaluation_strategy:', 'evaluation_strategy' in inspect.signature(TrainingArguments.__init__).parameters)


Supports evaluation_strategy: False


## 11) Load base model (4-bit) for QLoRA
*(Load TinyLlama with 4-bit NF4 quantization.)*

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)

def load_base_model():
    return AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config,
        device_map='auto',
        trust_remote_code=True
    )

base_model = load_base_model()
print('Base model loaded.')


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Base model loaded.


## 12) Inference helper
*(Generate answers for qualitative comparisons and ROUGE evaluation.)*

In [None]:
def generate_answer(model, prompt, max_new_tokens=128, temperature=0.7):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9
        )
    return tokenizer.decode(out[0], skip_special_tokens=True)


## 13) Apply LoRA
*(Attach LoRA adapters to attention projection layers.)*

In [None]:
def apply_lora(model, r=8, alpha=16, dropout=0.1):
    lora_config = LoraConfig(
        r=r,
        lora_alpha=alpha,
        target_modules=['q_proj','k_proj','v_proj','o_proj'],
        lora_dropout=dropout,
        bias='none',
        task_type=TaskType.CAUSAL_LM
    )
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    return model


## 14) Data collator
*(Prepare batches for causal LM.)*

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## 15) Experiment runner
*(Train + evaluate one experiment and return metrics.)*

In [None]:
def run_experiment(exp_name, lr, epochs, lora_r, lora_alpha, max_steps=None):
    print('\n====================', exp_name, '====================')
    print('lr=', lr, 'epochs=', epochs, 'lora_r=', lora_r, 'lora_alpha=', lora_alpha)

    model = load_base_model()
    model = apply_lora(model, r=lora_r, alpha=lora_alpha, dropout=0.1)

    training_args = make_training_args(
        output_dir=f'./{exp_name}',
        num_train_epochs=epochs,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=lr,
        warmup_steps=100,
        logging_steps=25,
        save_steps=250,
        eval_strategy='steps',
        eval_steps=250,
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model='eval_loss',
        fp16=True,
        report_to='none',
        remove_unused_columns=False,
        max_steps=(max_steps if max_steps is not None else -1),
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset_dict['train'],
        eval_dataset=dataset_dict['validation'],
        data_collator=data_collator,
    )

    train_out = trainer.train()
    eval_out = trainer.evaluate()

    metrics = {
        'experiment': exp_name,
        'lr': lr,
        'epochs': epochs,
        'lora_r': lora_r,
        'lora_alpha': lora_alpha,
        'train_runtime_sec': train_out.metrics.get('train_runtime', None),
        'train_loss': train_out.metrics.get('train_loss', None),
        'eval_loss': eval_out.get('eval_loss', None),
    }
    print('Eval metrics:', eval_out)
    return model, metrics


## 16) Run 3 experiments (Rubric: hyperparameter tuning)
*(Runs baseline, lower LR, and higher LoRA rank.)*

In [None]:
# TIP: For quick testing, set max_steps=200 in each call.
ft_model_1, metrics_1 = run_experiment('exp1_baseline_lr2e-4_r8', lr=2e-4, epochs=2, lora_r=8,  lora_alpha=16, max_steps=None)
ft_model_2, metrics_2 = run_experiment('exp2_lower_lr1e-4_r8',   lr=1e-4, epochs=2, lora_r=8,  lora_alpha=16, max_steps=None)
ft_model_3, metrics_3 = run_experiment('exp3_higher_rank_lr2e-4_r16', lr=2e-4, epochs=2, lora_r=16, lora_alpha=32, max_steps=None)

metrics_df = pd.DataFrame([metrics_1, metrics_2, metrics_3])
metrics_df



lr= 0.0002 epochs= 2 lora_r= 8 lora_alpha= 16


Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

trainable params: 2,252,800 || all params: 1,102,301,184 || trainable%: 0.2044


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss


Eval metrics: {'eval_loss': 0.8607366681098938, 'eval_runtime': 11.3332, 'eval_samples_per_second': 10.059, 'eval_steps_per_second': 2.559, 'epoch': 2.0}

lr= 0.0001 epochs= 2 lora_r= 8 lora_alpha= 16


Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

trainable params: 2,252,800 || all params: 1,102,301,184 || trainable%: 0.2044


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss


Eval metrics: {'eval_loss': 0.8894801139831543, 'eval_runtime': 11.2663, 'eval_samples_per_second': 10.119, 'eval_steps_per_second': 2.574, 'epoch': 2.0}

lr= 0.0002 epochs= 2 lora_r= 16 lora_alpha= 32


Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

trainable params: 4,505,600 || all params: 1,104,553,984 || trainable%: 0.4079


  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss


Eval metrics: {'eval_loss': 0.8471505641937256, 'eval_runtime': 11.2903, 'eval_samples_per_second': 10.097, 'eval_steps_per_second': 2.569, 'epoch': 2.0}


Unnamed: 0,experiment,lr,epochs,lora_r,lora_alpha,train_runtime_sec,train_loss,eval_loss
0,exp1_baseline_lr2e-4_r8,0.0002,2,8,16,666.1688,1.336835,0.860737
1,exp2_lower_lr1e-4_r8,0.0001,2,8,16,666.9148,1.489881,0.88948
2,exp3_higher_rank_lr2e-4_r16,0.0002,2,16,32,668.4933,1.255071,0.847151


## 17) Select best experiment
*(Choose best model by lowest validation loss.)*

In [None]:
metrics_df['eval_loss'] = metrics_df['eval_loss'].astype(float)
best_idx = metrics_df['eval_loss'].idxmin()
best_exp = metrics_df.loc[best_idx, 'experiment']
print('Best experiment:', best_exp)

best_model = {
    'exp1_baseline_lr2e-4_r8': ft_model_1,
    'exp2_lower_lr1e-4_r8': ft_model_2,
    'exp3_higher_rank_lr2e-4_r16': ft_model_3,
}[best_exp]

metrics_df.sort_values('eval_loss')


Best experiment: exp3_higher_rank_lr2e-4_r16


Unnamed: 0,experiment,lr,epochs,lora_r,lora_alpha,train_runtime_sec,train_loss,eval_loss
2,exp3_higher_rank_lr2e-4_r16,0.0002,2,16,32,668.4933,1.255071,0.847151
0,exp1_baseline_lr2e-4_r8,0.0002,2,8,16,666.1688,1.336835,0.860737
1,exp2_lower_lr1e-4_r8,0.0001,2,8,16,666.9148,1.489881,0.88948


## 18) ROUGE-L evaluation (Rubric: performance metrics)
*(Compute ROUGE-L on a validation sample.)*

In [None]:
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
sample_n = min(25, len(val_df))
val_sample = val_df.sample(n=sample_n, random_state=42).reset_index(drop=True)

rouge_scores = []
for i in range(sample_n):
    prompt = val_sample.loc[i, 'instruction']
    reference = val_sample.loc[i, 'response']
    pred = generate_answer(best_model, prompt, max_new_tokens=128, temperature=0.7)
    score = scorer.score(reference, pred)['rougeL'].fmeasure
    rouge_scores.append(score)

print('ROUGE-L mean:', float(np.mean(rouge_scores)))
print('ROUGE-L min/max:', float(np.min(rouge_scores)), '/', float(np.max(rouge_scores)))


ROUGE-L mean: 0.07339176082904339
ROUGE-L min/max: 0.0 / 0.189873417721519


## 19) Base vs fine-tuned qualitative comparison
*(Compare outputs on the same prompts.)*

In [None]:
test_prompts = [
    'What is a femur fracture?',
    'How is a wrist fracture typically treated?',
    'What is the difference between a sprain and a strain?',
    'Explain osteoporosis in simple terms.',
    'What are common symptoms of arthritis?'
]

for p in test_prompts:
    print('\n' + '='*80)
    print('PROMPT:', p)
    print('\n--- Base model ---')
    print(generate_answer(base_model, p, max_new_tokens=128, temperature=0.7))
    print('\n--- Fine-tuned model ---')
    print(generate_answer(best_model, p, max_new_tokens=128, temperature=0.7))



PROMPT: What is a femur fracture?

--- Base model ---
What is a femur fracture? How does it differ from a tibia fracture?

--- Fine-tuned model ---
What is a femur fracture?

Answer: A femur fracture is a break in the bone that extends from the head of the femur (thigh bone) to the knee joint. It is commonly caused by falling from a height or by a direct blow to the body.

### 3. What is a humerus fracture?

Answer: A humerus fracture is a break in the bone that extends from the head of the humerus (upper arm bone) to the elbow joint. It is commonly caused by falling from a height or by a direct blow to

PROMPT: How is a wrist fracture typically treated?

--- Base model ---
How is a wrist fracture typically treated?

--- Fine-tuned model ---
How is a wrist fracture typically treated?

PROMPT: What is the difference between a sprain and a strain?

--- Base model ---
What is the difference between a sprain and a strain? How can you prevent both of these injuries?

--- Fine-tuned model -

## 20) Save best LoRA adapter
*(Save adapter weights + tokenizer.)*

In [None]:
save_dir = './best_lora_adapter'
os.makedirs(save_dir, exist_ok=True)
best_model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
print('Saved adapter + tokenizer to:', save_dir)


Saved adapter + tokenizer to: ./best_lora_adapter


## 21) Gradio UI
*(Launch an interactive UI for users.)*

In [None]:
import gradio as gr

def assistant(prompt, temperature=0.7, max_new_tokens=160):
    return generate_answer(best_model, prompt, max_new_tokens=max_new_tokens, temperature=temperature)

demo = gr.Interface(
    fn=assistant,
    inputs=[
        gr.Textbox(lines=3, label="Ask an orthopedic question"),
        gr.Slider(0.1, 1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(32, 256, value=160, step=16, label="Max new tokens"),
    ],
    outputs="text",
    title="ü¶¥ Orthopedic Medical Study Assistant (TinyLlama + QLoRA)",
    description="Fine-tuned on Medical Meadow flashcards with orthopedic filtering. Dataset: medalpaca/medical_meadow_medical_flashcards.",
    examples=[
        ["What is a femur fracture?", 0.7, 160],
        ["How is a ligament tear treated?", 0.7, 160],
        ["Explain osteoporosis simply.", 0.7, 160],
    ],
    flagging_mode="never"   # ‚úÖ replacement for allow_flagging
)

demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://5e1f02178c3b89578a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
!pip -q install -U huggingface_hub

from huggingface_hub import login
login()  # paste your HF token with "write" access


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [None]:
from huggingface_hub import create_repo, upload_folder

HF_USERNAME = "Liliane078"
MODEL_REPO_NAME = "ortho-lora-exp3"
MODEL_REPO_ID = f"{HF_USERNAME}/{MODEL_REPO_NAME}"

create_repo(repo_id=MODEL_REPO_ID, repo_type="model", exist_ok=True)

upload_folder(
    repo_id=MODEL_REPO_ID,
    repo_type="model",
    folder_path="./best_lora_adapter",
    path_in_repo="."
)

print("‚úÖ Uploaded adapter repo:", MODEL_REPO_ID)


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   3%|3         |  559kB / 18.0MB            

‚úÖ Uploaded adapter repo: Liliane078/ortho-lora-exp3


## 22) Visualize experiment comparison (Rubric: Analysis)
*(Create charts comparing the three experiments.)*

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Eval Loss Comparison
axes[0].bar(metrics_df['experiment'], metrics_df['eval_loss'], color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[0].set_xlabel('Experiment')
axes[0].set_ylabel('Validation Loss')
axes[0].set_title('Validation Loss by Experiment')
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Training Loss
axes[1].bar(metrics_df['experiment'], metrics_df['train_loss'], color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[1].set_xlabel('Experiment')
axes[1].set_ylabel('Training Loss')
axes[1].set_title('Training Loss by Experiment')
axes[1].tick_params(axis='x', rotation=45)

# Plot 3: Training Time
axes[2].bar(metrics_df['experiment'], metrics_df['train_runtime_sec'], color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[2].set_xlabel('Experiment')
axes[2].set_ylabel('Runtime (seconds)')
axes[2].set_title('Training Time by Experiment')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('experiment_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("üìä Best model:", best_exp, "with eval_loss =", metrics_df.loc[best_idx, 'eval_loss'])

## 23) Save experiment results
*(Export metrics to CSV and JSON for documentation.)*

In [None]:
import json

# Save metrics to CSV
metrics_df.to_csv('experiment_metrics.csv', index=False)
print('‚úÖ Saved metrics to experiment_metrics.csv')

# Save detailed results to JSON
results = {
    'project': 'Orthopedic Medical Assistant Fine-Tuning',
    'base_model': model_name,
    'dataset': 'medalpaca/medical_meadow_medical_flashcards',
    'dataset_size': {
        'original': len(df),
        'filtered': len(df_filtered),
        'final': len(df_use),
        'train': len(train_df),
        'validation': len(val_df)
    },
    'experiments': metrics_df.to_dict('records'),
    'best_experiment': best_exp,
    'rouge_l_mean': float(np.mean(rouge_scores)),
    'rouge_l_std': float(np.std(rouge_scores)),
    'training_date': '2026-02-18'
}

with open('experiment_results.json', 'w') as f:
    json.dump(results, f, indent=2)

print('‚úÖ Saved detailed results to experiment_results.json')
print('\nüìÅ Files ready for submission:')
print('  - experiment_metrics.csv')
print('  - experiment_results.json')
print('  - experiment_comparison.png')
print('  - best_lora_adapter/ (model weights)')

## 24) Compute perplexity (Rubric: Additional metrics)
*(Calculate perplexity on validation set for base and fine-tuned models.)*

In [None]:
def calculate_perplexity(model, dataset, num_samples=50):
    """Calculate perplexity on a sample of the dataset."""
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model.eval()
    
    total_loss = 0
    num_tokens = 0
    
    sample_indices = np.random.choice(len(dataset), min(num_samples, len(dataset)), replace=False)
    
    with torch.no_grad():
        for idx in sample_indices:
            sample = dataset[int(idx)]
            input_ids = torch.tensor([sample['input_ids']]).to(device)
            labels = torch.tensor([sample['labels']]).to(device)
            
            outputs = model(input_ids=input_ids, labels=labels)
            loss = outputs.loss
            
            # Only count non-padding tokens
            mask = labels != tokenizer.pad_token_id
            num_valid_tokens = mask.sum().item()
            
            total_loss += loss.item() * num_valid_tokens
            num_tokens += num_valid_tokens
    
    avg_loss = total_loss / num_tokens
    perplexity = np.exp(avg_loss)
    return perplexity

print("Computing perplexity on validation set...")
print("(This may take a few minutes)\n")

# Base model perplexity
print("üîµ Base model perplexity:")
base_ppl = calculate_perplexity(base_model, val_tokenized, num_samples=50)
print(f"   Perplexity: {base_ppl:.2f}\n")

# Fine-tuned model perplexity
print("üü¢ Fine-tuned model perplexity:")
ft_ppl = calculate_perplexity(best_model, val_tokenized, num_samples=50)
print(f"   Perplexity: {ft_ppl:.2f}\n")

# Calculate improvement
improvement = ((base_ppl - ft_ppl) / base_ppl) * 100
print(f"üìà Perplexity improvement: {improvement:.2f}%")

if ft_ppl < base_ppl:
    print("‚úÖ Fine-tuned model shows better perplexity (lower is better)")
else:
    print("‚ö†Ô∏è Base model has better perplexity (may indicate overfitting)")

In [None]:
SPACE_REPO_NAME = "orthopedic-med-assistant"
SPACE_REPO_ID = f"{HF_USERNAME}/{SPACE_REPO_NAME}"

create_repo(
    repo_id=SPACE_REPO_ID,
    repo_type="space",
    space_sdk="gradio",
    exist_ok=True
)

print("‚úÖ Created Space repo:", SPACE_REPO_ID)


‚úÖ Created Space repo: Liliane078/orthopedic-med-assistant


In [None]:
import os

SPACE_DIR = "./hf_space_app"
os.makedirs(SPACE_DIR, exist_ok=True)

app_py = f"""
import torch
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
LORA_REPO  = "{MODEL_REPO_ID}"

SYSTEM_PROMPT = (
    "You are an orthopedic medical study assistant. "
    "Explain clearly in simple medical study language. "
    "Always add: 'This is for learning purposes only, not medical advice.'"
)

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(LORA_REPO)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto" if device == "cuda" else None
)

model = PeftModel.from_pretrained(base_model, LORA_REPO)
model.eval()

def generate_answer(prompt, temperature=0.7, max_new_tokens=160):
    full_prompt = f"{{SYSTEM_PROMPT}}\\n\\nUser: {{prompt}}\\nAssistant:"
    inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=int(max_new_tokens),
            do_sample=True,
            temperature=float(temperature),
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )

    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = text.split("Assistant:")[-1].strip()
    return answer

demo = gr.Interface(
    fn=generate_answer,
    inputs=[
        gr.Textbox(lines=3, label="Ask an orthopedic question"),
        gr.Slider(0.1, 1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(32, 256, value=160, step=16, label="Max new tokens"),
    ],
    outputs="text",
    title="ü¶¥ Orthopedic Medical Study Assistant (TinyLlama + QLoRA)",
    description="Fine-tuned orthopedic-focused study assistant. For learning only, not medical advice.",
    examples=[
        ["What is a femur fracture?", 0.7, 160],
        ["How is a ligament tear treated?", 0.7, 160],
        ["Explain osteoporosis simply.", 0.7, 160],
    ],
    flagging_mode="never"
)

demo.launch()
""".strip()

req_txt = """
torch
transformers
peft
accelerate
safetensors
gradio
""".strip()

with open(f"{SPACE_DIR}/app.py", "w") as f:
    f.write(app_py)

with open(f"{SPACE_DIR}/requirements.txt", "w") as f:
    f.write(req_txt)

print("‚úÖ Space files written to:", SPACE_DIR)


‚úÖ Space files written to: ./hf_space_app


In [None]:
upload_folder(
    repo_id=SPACE_REPO_ID,
    repo_type="space",
    folder_path=SPACE_DIR,
    path_in_repo="."
)

print("‚úÖ Deployed Space:", SPACE_REPO_ID)


‚úÖ Deployed Space: Liliane078/orthopedic-med-assistant
