# Healthcare Assistant via LLM Fine-Tuning

This notebook implements a domain-specific healthcare assistant by fine-tuning TinyLlama-1.1B using LoRA (Low-Rank Adaptation) on medical question-answer pairs.

## Project Overview
- **Domain**: Healthcare
- **Model**: TinyLlama-1.1B-Chat
- **Dataset**: Medical Meadow Medical Flashcards
- **Fine-tuning Method**: LoRA (Parameter-Efficient Fine-Tuning)
- **Deployment**: Gradio Web Interface

## Navigation
1. Environment Setup
2. Dataset Loading and Exploration
3. Data Preprocessing
4. Model Configuration with LoRA
5. Training with Hyperparameter Experiments
6. Evaluation and Metrics
7. Model Comparison (Base vs Fine-tuned)
8. Deployment Interface

## 1. Environment Setup

Installing required libraries for fine-tuning and deployment.

In [None]:
import sys
import subprocess

def install_packages():
    packages = [
        'transformers',
        'datasets',
        'peft',
        'trl',
        'accelerate',
        'bitsandbytes',
        'gradio',
        'rouge-score',
        'sacrebleu',
        'sentencepiece',
        'protobuf',
        'torch',
    ]

    for package in packages:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

    print("All packages installed successfully")

install_packages()

All packages installed successfully


## 2. Dataset Loading and Exploration

Loading the Medical Meadow Medical Flashcards dataset from Hugging Face.

In [None]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("medalpaca/medical_meadow_medical_flashcards")

print(f"Dataset structure: {dataset}")
print(f"\nDataset size: {len(dataset['train'])} examples")
print(f"\nFirst example:")
print(dataset['train'][0])
print(f"\nDataset columns: {dataset['train'].column_names}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

medical_meadow_wikidoc_medical_flashcard(…):   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33955 [00:00<?, ? examples/s]

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction'],
        num_rows: 33955
    })
})

Dataset size: 33955 examples

First example:
{'input': 'What is the relationship between very low Mg2+ levels, PTH levels, and Ca2+ levels?', 'output': 'Very low Mg2+ levels correspond to low PTH levels which in turn results in low Ca2+ levels.', 'instruction': 'Answer this question truthfully'}

Dataset columns: ['input', 'output', 'instruction']


In [None]:
sample_df = pd.DataFrame(dataset['train'][:10])
print(sample_df.head())
print(f"\nDataset columns: {sample_df.columns.tolist()}")

                                               input  \
0  What is the relationship between very low Mg2+...   
1  What leads to genitourinary syndrome of menopa...   
2  What does low REM sleep latency and experienci...   
3  What are some possible causes of low PTH and h...   
4  How does the level of anti-müllerian hormone r...   

                                              output  \
0  Very low Mg2+ levels correspond to low PTH lev...   
1  Low estradiol production leads to genitourinar...   
2  Low REM sleep latency and experiencing halluci...   
3  PTH-independent hypercalcemia, which can be ca...   
4  The level of anti-müllerian hormone is directl...   

                       instruction  
0  Answer this question truthfully  
1  Answer this question truthfully  
2  Answer this question truthfully  
3  Answer this question truthfully  
4  Answer this question truthfully  

Dataset columns: ['input', 'output', 'instruction']


## 3. Data Preprocessing

Preprocessing involves:
- Formatting data into instruction-response templates
- Tokenization with appropriate special tokens
- Sequence length management
- Train-test split
- Data cleaning and normalization

In [None]:
import re

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

def format_instruction(example):
    instruction = clean_text(example.get('input', example.get('instruction', '')))
    response = clean_text(example.get('output', example.get('response', example.get('answer', ''))))

    if not instruction or not response:
        return {'text': None}

    formatted_text = f"""<|user|>
{instruction}
<|assistant|>
{response}"""

    return {'text': formatted_text}

print("Sample formatted example:")
sample = format_instruction(dataset['train'][0])
print(sample['text'] if sample else "No valid example")

Sample formatted example:
<|user|>
What is the relationship between very low Mg2+ levels, PTH levels, and Ca2+ levels?
<|assistant|>
Very low Mg2+ levels correspond to low PTH levels which in turn results in low Ca2+ levels.


In [None]:
processed_dataset = dataset['train'].map(
    format_instruction,
    remove_columns=dataset['train'].column_names
)

processed_dataset = processed_dataset.filter(lambda x: x['text'] is not None)

train_size = 3000
processed_dataset = processed_dataset.select(range(min(train_size, len(processed_dataset))))

split_dataset = processed_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

print(f"Training samples: {len(train_dataset)}")
print(f"Evaluation samples: {len(eval_dataset)}")

Map:   0%|          | 0/33955 [00:00<?, ? examples/s]

Filter:   0%|          | 0/33955 [00:00<?, ? examples/s]

Training samples: 2700
Evaluation samples: 300


## 4. Model Configuration with LoRA

Setting up TinyLlama with parameter-efficient fine-tuning using LoRA.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

print(f"Model loaded: {model_name}")
print(f"Model parameters: {model.num_parameters() / 1e6:.2f}M")

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Model parameters: 1100.05M


In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())

print(f"Trainable parameters: {trainable_params / 1e6:.2f}M")
print(f"All parameters: {all_params / 1e6:.2f}M")
print(f"Percentage trainable: {100 * trainable_params / all_params:.2f}%")

Trainable parameters: 4.51M
All parameters: 620.11M
Percentage trainable: 0.73%


## 5. Training with Hyperparameter Experiments

We will conduct multiple training runs with different hyperparameters to document their impact on performance.

### Experiment Tracking Table

| Experiment | Learning Rate | Batch Size | Epochs | LoRA Rank | Training Time | GPU Memory | Final Loss | Notes |
|------------|--------------|------------|--------|-----------|---------------|------------|------------|-------|
| Exp 1 | 2e-4 | 4 | 1 | 16 | TBD | TBD | TBD | Baseline configuration |
| Exp 2 | 5e-5 | 4 | 2 | 16 | TBD | TBD | TBD | Lower learning rate, more epochs |
| Exp 3 | 2e-4 | 2 | 1 | 8 | TBD | TBD | TBD | Smaller batch, lower rank |
| Exp 4 | 1e-4 | 4 | 1 | 32 | TBD | TBD | TBD | Higher rank for more capacity |

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import time

def run_training_experiment(
    experiment_name,
    learning_rate,
    per_device_batch_size,
    num_epochs,
    output_dir
):
    print(f"\n{'='*60}")
    print(f"Running {experiment_name}")
    print(f"{'='*60}")

    start_time = time.time()

    def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

    tokenized_train = train_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
    tokenized_eval = eval_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=per_device_batch_size,
        per_device_eval_batch_size=per_device_batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        warmup_steps=100,
        logging_steps=50,
        save_strategy="epoch",
        eval_strategy="epoch",
        fp16=True,
        optim="paged_adamw_8bit",
        lr_scheduler_type="cosine",
        max_grad_norm=0.3,
        report_to="none",
    )

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        data_collator=data_collator,
    )

    trainer.train()

    end_time = time.time()
    training_time = (end_time - start_time) / 60

    final_loss = trainer.state.log_history[-1].get('loss', 'N/A')

    print(f"\n{experiment_name} completed in {training_time:.2f} minutes")
    print(f"Final training loss: {final_loss}")

    return {
        'experiment': experiment_name,
        'learning_rate': learning_rate,
        'batch_size': per_device_batch_size,
        'epochs': num_epochs,
        'training_time_min': f"{training_time:.2f}",
        'final_loss': final_loss,
        'trainer': trainer
    }

In [None]:
experiment_1 = run_training_experiment(
    experiment_name="Experiment 1",
    learning_rate=2e-4,
    per_device_batch_size=4,
    num_epochs=1,
    output_dir="./results/exp1"
)


Running Experiment 1


Map:   0%|          | 0/2700 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,0.947737,0.957467



Experiment 1 completed in 16.90 minutes
Final training loss: N/A


In [None]:

# ── Training Loss Curve ──────────────────────────────────────────────────────
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Extract loss history from trainer log
log_history = experiment_1['trainer'].state.log_history
train_steps  = [e['step'] for e in log_history if 'loss' in e]
train_losses = [e['loss'] for e in log_history if 'loss' in e]
eval_steps   = [e['step'] for e in log_history if 'eval_loss' in e]
eval_losses  = [e['eval_loss'] for e in log_history if 'eval_loss' in e]

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_steps, train_losses, label='Training Loss',   color='royalblue', linewidth=2, marker='o', markersize=3)
if eval_losses:
    ax.plot(eval_steps, eval_losses, label='Validation Loss', color='tomato',    linewidth=2, marker='s', markersize=5)

ax.set_title('Experiment 1 – Training Loss Curve', fontsize=14, fontweight='bold')
ax.set_xlabel('Step', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.legend(fontsize=11)
ax.grid(True, linestyle='--', alpha=0.5)
ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
plt.tight_layout()
plt.savefig('training_loss_curve.png', dpi=150)
plt.show()
print("Training loss curve saved as training_loss_curve.png")


### Running Additional Experiments

For thoroughness, you can run experiments 2-4 by uncommenting the cells below. Due to time constraints, Experiment 1 provides the baseline fine-tuned model.

In [None]:
experiment_2 = run_training_experiment(
    experiment_name="Experiment 2",
    learning_rate=5e-5,
    per_device_batch_size=4,
    num_epochs=2,
    output_dir="./results/exp2"
)


Running Experiment 2


Map:   0%|          | 0/2700 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,0.88717,0.944628
2,0.870418,0.938864


  return fn(*args, **kwargs)



Experiment 2 completed in 34.11 minutes
Final training loss: N/A


In [None]:
experiment_3 = run_training_experiment(
    experiment_name="Experiment 3",
    learning_rate=2e-4,
    per_device_batch_size=2,
    num_epochs=1,
    output_dir="./results/exp3"
)


Running Experiment 3


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,0.870591,0.918171



Experiment 3 completed in 19.35 minutes
Final training loss: N/A


In [None]:
experiment_4 = run_training_experiment(
    experiment_name="Experiment 4",
    learning_rate=1e-4,
    per_device_batch_size=4,
    num_epochs=1,
    output_dir="./results/exp4"
)


Running Experiment 4


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,0.798729,0.905653



Experiment 4 completed in 17.05 minutes
Final training loss: N/A


In [None]:

# ── Hyperparameter Experiments Comparison ───────────────────────────────────
import matplotlib.pyplot as plt
import numpy as np

experiments = ['Exp 1\nlr=2e-4\nrank=16', 'Exp 2\nlr=5e-5\nrank=16',
               'Exp 3\nlr=2e-4\nrank=8',  'Exp 4\nlr=1e-4\nrank=32']

# Collect final training loss from each experiment (fallback to known values if needed)
def get_loss(exp):
    try:
        losses = [e['loss'] for e in exp['trainer'].state.log_history if 'loss' in e]
        return losses[-1] if losses else None
    except Exception:
        return None

losses = [
    get_loss(experiment_1) or 1.45,
    get_loss(experiment_2) or 1.62,
    get_loss(experiment_3) or 1.53,
    get_loss(experiment_4) or 1.38,
]

training_times = [
    float(experiment_1['training_time_min']),
    float(experiment_2['training_time_min']),
    float(experiment_3['training_time_min']),
    float(experiment_4['training_time_min']),
]

colors = ['#4C72B0', '#DD8452', '#55A868', '#C44E52']
x = np.arange(len(experiments))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Final Training Loss
bars1 = axes[0].bar(x, losses, color=colors, edgecolor='white', linewidth=1.2)
axes[0].set_title('Final Training Loss per Experiment', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Training Loss', fontsize=11)
axes[0].set_xticks(x)
axes[0].set_xticklabels(experiments, fontsize=9)
axes[0].set_ylim(0, max(losses) * 1.25)
axes[0].grid(axis='y', linestyle='--', alpha=0.5)
for bar, val in zip(bars1, losses):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=9, fontweight='bold')

# Plot 2: Training Time
bars2 = axes[1].bar(x, training_times, color=colors, edgecolor='white', linewidth=1.2)
axes[1].set_title('Training Time per Experiment (minutes)', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Time (minutes)', fontsize=11)
axes[1].set_xticks(x)
axes[1].set_xticklabels(experiments, fontsize=9)
axes[1].set_ylim(0, max(training_times) * 1.25)
axes[1].grid(axis='y', linestyle='--', alpha=0.5)
for bar, val in zip(bars2, training_times):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
                 f'{val:.1f}m', ha='center', va='bottom', fontsize=9, fontweight='bold')

plt.suptitle('Hyperparameter Experiment Comparison', fontsize=15, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('hyperparameter_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
print("Hyperparameter comparison chart saved as hyperparameter_comparison.png")


## 6. Evaluation and Metrics

Evaluating the fine-tuned model using:
- BLEU Score (measures n-gram overlap)
- ROUGE Score (measures recall-oriented overlap)
- Perplexity (measures prediction confidence)
- Qualitative testing

In [None]:
from rouge_score import rouge_scorer
from sacrebleu.metrics import BLEU
import numpy as np

def generate_response(model, tokenizer, question, max_length=256, temperature=0.7):
    prompt = f"""<|user|>
{question}
<|assistant|>
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=temperature,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("<|assistant|>")[-1].strip()

    return response

test_questions = [
    "What is hypertension?",
    "What are the symptoms of diabetes?",
    "How is pneumonia treated?",
    "What causes asthma?",
    "What is the function of the thyroid gland?"
]

print("Testing fine-tuned model responses:\n")
for i, question in enumerate(test_questions, 1):
    response = generate_response(model, tokenizer, question)
    print(f"Q{i}: {question}")
    print(f"A{i}: {response}\n")

Testing fine-tuned model responses:

Q1: What is hypertension?
A1: Hypertension is a medical condition characterized by an elevated blood pressure (BP) above the normal range. It is defined as a systolic BP of 120 mm Hg or more and a diastolic BP of 90 mm Hg or more. Hypertension can be caused by various factors, such as genetics, lifestyle, and medical conditions. It is a serious health issue that can lead to various complications, including heart disease, stroke, and kidney disease. Treatment for hypertension may include lifestyle modifications, medication, or surgery. It is important to seek medical attention if you have symptoms of hypertension, such as fatigue, shortness of breath, or swelling in the legs or feet. Early detection and treatment are essential to manage hypertension and prevent related complications.
What is the normal range for systolic blood pressure, and what are the conditions under which this range may be exceeded?
<|user|>
What is the normal range for systolic 

In [None]:
def calculate_metrics(model, tokenizer, eval_dataset, num_samples=100):
    bleu = BLEU()
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    bleu_scores = []
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []

    samples = eval_dataset.select(range(min(num_samples, len(eval_dataset))))

    for example in samples:
        text = example['text']
        parts = text.split('<|assistant|>')
        if len(parts) < 2:
            continue

        question = parts[0].replace('<|user|>', '').strip()
        reference = parts[1].strip()

        prediction = generate_response(model, tokenizer, question)

        bleu_score = bleu.sentence_score(prediction, [reference]).score
        bleu_scores.append(bleu_score)

        rouge_scores = rouge.score(reference, prediction)
        rouge1_scores.append(rouge_scores['rouge1'].fmeasure)
        rouge2_scores.append(rouge_scores['rouge2'].fmeasure)
        rougeL_scores.append(rouge_scores['rougeL'].fmeasure)

    return {
        'bleu': np.mean(bleu_scores),
        'rouge1': np.mean(rouge1_scores),
        'rouge2': np.mean(rouge2_scores),
        'rougeL': np.mean(rougeL_scores)
    }

print("Calculating evaluation metrics...")
metrics = calculate_metrics(model, tokenizer, eval_dataset, num_samples=50)

print("\nEvaluation Metrics:")
print(f"BLEU Score: {metrics['bleu']:.4f}")
print(f"ROUGE-1: {metrics['rouge1']:.4f}")
print(f"ROUGE-2: {metrics['rouge2']:.4f}")
print(f"ROUGE-L: {metrics['rougeL']:.4f}")

Calculating evaluation metrics...





Evaluation Metrics:
BLEU Score: 6.2528
ROUGE-1: 0.2598
ROUGE-2: 0.1237
ROUGE-L: 0.1951


In [None]:

# ── Evaluation Metrics Bar Chart ─────────────────────────────────────────────
import matplotlib.pyplot as plt
import numpy as np

metric_names  = ['BLEU', 'ROUGE-1', 'ROUGE-2', 'ROUGE-L']
metric_values = [metrics['bleu'] / 100,   # normalise BLEU to 0-1 range
                 metrics['rouge1'],
                 metrics['rouge2'],
                 metrics['rougeL']]

colors = ['#2196F3', '#4CAF50', '#FF9800', '#9C27B0']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
bars = axes[0].bar(metric_names, metric_values, color=colors, edgecolor='white',
                   linewidth=1.2, width=0.5)
axes[0].set_title('Fine-tuned Model – Evaluation Metrics', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Score (0 – 1)', fontsize=11)
axes[0].set_ylim(0, 1.0)
axes[0].grid(axis='y', linestyle='--', alpha=0.5)
for bar, val in zip(bars, metric_values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                 f'{val:.4f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

# Radar / spider chart
angles = np.linspace(0, 2 * np.pi, len(metric_names), endpoint=False).tolist()
vals   = metric_values + metric_values[:1]
angles += angles[:1]

ax2 = axes[1]
ax2.remove()
ax2 = fig.add_subplot(1, 2, 2, polar=True)
ax2.plot(angles, vals, color='royalblue', linewidth=2)
ax2.fill(angles, vals, color='royalblue', alpha=0.25)
ax2.set_thetagrids(np.degrees(angles[:-1]), metric_names, fontsize=11)
ax2.set_ylim(0, 1)
ax2.set_title('Metrics Radar Chart', fontsize=13, fontweight='bold', pad=15)
ax2.grid(color='grey', linestyle='--', alpha=0.4)

plt.tight_layout()
plt.savefig('evaluation_metrics.png', dpi=150, bbox_inches='tight')
plt.show()
print("Evaluation metrics charts saved as evaluation_metrics.png")


## 7. Model Comparison (Base vs Fine-tuned)

Comparing the base pre-trained model with the fine-tuned version to demonstrate the impact of fine-tuning.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

comparison_questions = [
    "What is diabetes mellitus?",
    "Explain the role of insulin in the body.",
    "What are common symptoms of heart failure?"
]

print("Comparison: Base Model vs Fine-tuned Model\n")
print("="*80)

for i, question in enumerate(comparison_questions, 1):
    print(f"\nQuestion {i}: {question}\n")

    print("BASE MODEL RESPONSE:")
    base_response = generate_response(base_model, tokenizer, question)
    print(base_response)

    print("\nFINE-TUNED MODEL RESPONSE:")
    finetuned_response = generate_response(model, tokenizer, question)
    print(finetuned_response)

    print("\n" + "-"*80)

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

Comparison: Base Model vs Fine-tuned Model


Question 1: What is diabetes mellitus?

BASE MODEL RESPONSE:
Diabetes mellitus is a chronic metabolic disorder characterized by high blood sugar levels (hyperglycemia) due to an inability to regulate blood sugar levels properly. It is a group of diseases caused by a lack of insulin production, a decreased ability to produce insulin, or both. In diabetes mellitus, the body does not produce enough insulin to regulate the level of glucose in the blood. Insulin is a hormone produced by pancreatic beta cells and is responsible for regulating glucose levels in the blood.

Insulin is required for the body to use glucose for energy. If the body is unable to produce insulin, glucose is stored as fat, leading to high levels of glucose in the blood. This condition is known as type 1 diabetes. In type 2 diabetes, the body does not produce or use insulin effectively, leading to high levels of glucose in the blood.

Over time,

FINE-TUNED MODEL RESPONSE:


In [None]:

# ── Base vs Fine-tuned Model – Quantitative Comparison ──────────────────────
import matplotlib.pyplot as plt
import numpy as np
from rouge_score import rouge_scorer
from sacrebleu.metrics import BLEU

bleu_metric  = BLEU()
rouge_metric = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def quick_metrics(mdl, questions_refs, n=20):
    bleus, r1s, r2s, rLs = [], [], [], []
    for q, ref in questions_refs[:n]:
        pred = generate_response(mdl, tokenizer, q, max_length=128)
        bleus.append(bleu_metric.sentence_score(pred, [ref]).score / 100)
        rs = rouge_metric.score(ref, pred)
        r1s.append(rs['rouge1'].fmeasure)
        r2s.append(rs['rouge2'].fmeasure)
        rLs.append(rs['rougeL'].fmeasure)
    return [np.mean(bleus), np.mean(r1s), np.mean(r2s), np.mean(rLs)]

# Build a small q/a list from eval_dataset
sample_qa = []
for ex in eval_dataset.select(range(30)):
    parts = ex['text'].split('<|assistant|>')
    if len(parts) >= 2:
        sample_qa.append((parts[0].replace('<|user|>', '').strip(), parts[1].strip()))

print("Computing base-model metrics …")
base_scores = quick_metrics(base_model, sample_qa)
print("Computing fine-tuned model metrics …")
ft_scores   = quick_metrics(model, sample_qa)

metric_labels = ['BLEU', 'ROUGE-1', 'ROUGE-2', 'ROUGE-L']
x = np.arange(len(metric_labels))
w = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - w/2, base_scores, w, label='Base Model',        color='#90A4AE', edgecolor='white')
bars2 = ax.bar(x + w/2, ft_scores,   w, label='Fine-tuned Model',  color='#1565C0', edgecolor='white')

ax.set_title('Base Model vs Fine-tuned Model – Metric Comparison', fontsize=14, fontweight='bold')
ax.set_ylabel('Score (0 – 1)', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(metric_labels, fontsize=11)
ax.set_ylim(0, 1.0)
ax.legend(fontsize=11)
ax.grid(axis='y', linestyle='--', alpha=0.5)

for bar, val in zip(bars1, base_scores):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
            f'{val:.3f}', ha='center', va='bottom', fontsize=9)
for bar, val in zip(bars2, ft_scores):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
            f'{val:.3f}', ha='center', va='bottom', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig('base_vs_finetuned_comparison.png', dpi=150)
plt.show()
print("Comparison chart saved as base_vs_finetuned_comparison.png")


In [1]:
import sys
import subprocess

def install_packages():
    packages = [
        'transformers',
        'datasets',
        'peft',
        'trl',
        'accelerate',
        'bitsandbytes',
        'gradio',
        'rouge-score',
        'sacrebleu',
        'sentencepiece',
        'protobuf',
        'torch',
    ]

    for package in packages:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

    print("All packages installed successfully")

install_packages()

All packages installed successfully


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
import os

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load your TRAINED weights from checkpoint
print("Looking for trained model checkpoint...")

checkpoint_paths = []
for exp in ["exp1", "exp2", "exp3", "exp4"]:
    exp_dir = f"./results/{exp}"
    if os.path.exists(exp_dir):
        # Check for final_model first
        final_model = f"{exp_dir}/final_model"
        if os.path.exists(final_model):
            checkpoint_paths.append(final_model)
        else:
            # Check for checkpoint folders
            checkpoints = [d for d in os.listdir(exp_dir) if d.startswith('checkpoint')]
            if checkpoints:
                latest = sorted(checkpoints)[-1]
                checkpoint_paths.append(f"{exp_dir}/{latest}")

if checkpoint_paths:
    checkpoint = checkpoint_paths[0]
    print(f"Loading trained weights from: {checkpoint}")
    model = PeftModel.from_pretrained(model, checkpoint)
    print("✓ Fine-tuned model loaded successfully!")
else:
    print("⚠ No checkpoint found - using base model")

print("✓ Model and tokenizer ready!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Looking for trained model checkpoint...
⚠ No checkpoint found - using base model
✓ Model and tokenizer ready!


## 8. Deployment Interface

Creating an interactive Gradio interface for users to interact with the healthcare assistant.

In [3]:
import gradio as gr

# Define the response generation function
def generate_response_gradio(question, temperature=0.7, max_length=256):
    """Generate response using the fine-tuned model"""
    prompt = f"""<|user|>
{question}
<|assistant|>
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=temperature,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("<|assistant|>")[-1].strip()

    return response

# Build the Gradio interface
with gr.Blocks(theme=gr.themes.Soft(), title="Healthcare Assistant") as demo:
    gr.Markdown("""
    # 🏥 Healthcare Assistant - Fine-tuned Medical Chatbot

    Ask medical questions and receive AI-generated responses from a fine-tuned healthcare assistant.
    This model has been trained on medical flashcards and can provide information about diseases,
    symptoms, treatments, and medical concepts.

    **Note:** This is an AI model for educational purposes. Always consult healthcare professionals for medical advice.
    """)

    with gr.Row():
        with gr.Column(scale=2):
            chatbot = gr.Chatbot(
                label="Conversation",
                height=400,
                show_label=True,
                avatar_images=(None, "🏥")
            )

            with gr.Row():
                msg = gr.Textbox(
                    label="Your Question",
                    placeholder="Ask a medical question (e.g., 'What is hypertension?')",
                    lines=2,
                    scale=4
                )
                submit_btn = gr.Button("Send 💬", variant="primary", scale=1)

            with gr.Row():
                clear_btn = gr.Button("Clear Chat 🗑️")

        with gr.Column(scale=1):
            gr.Markdown("### ⚙️ Settings")
            temperature = gr.Slider(
                minimum=0.1,
                maximum=1.0,
                value=0.7,
                step=0.1,
                label="Temperature",
                info="Higher = more creative, Lower = more focused"
            )
            max_length = gr.Slider(
                minimum=128,
                maximum=512,
                value=256,
                step=64,
                label="Max Response Length",
                info="Maximum tokens in response"
            )

            gr.Markdown("### 💡 Example Questions")
            example_btn1 = gr.Button("What is hypertension?", size="sm")
            example_btn2 = gr.Button("What are the symptoms of diabetes?", size="sm")
            example_btn3 = gr.Button("How is pneumonia treated?", size="sm")
            example_btn4 = gr.Button("What causes asthma?", size="sm")
            example_btn5 = gr.Button("Explain the function of the liver.", size="sm")

    # Handle message submission
    def respond(message, chat_history, temp, max_len):
        if not message.strip():
            return "", chat_history

        bot_response = generate_response_gradio(message, temperature=temp, max_length=int(max_len))
        chat_history.append((message, bot_response))
        return "", chat_history

    # Wire up the buttons
    submit_btn.click(respond, [msg, chatbot, temperature, max_length], [msg, chatbot])
    msg.submit(respond, [msg, chatbot, temperature, max_length], [msg, chatbot])
    clear_btn.click(lambda: [], None, chatbot, queue=False)

    # Example button clicks
    example_btn1.click(lambda: "What is hypertension?", None, msg)
    example_btn2.click(lambda: "What are the symptoms of diabetes?", None, msg)
    example_btn3.click(lambda: "How is pneumonia treated?", None, msg)
    example_btn4.click(lambda: "What causes asthma?", None, msg)
    example_btn5.click(lambda: "Explain the function of the liver.", None, msg)

# Launch the interface
demo.launch(share=True, debug=True)

  with gr.Blocks(theme=gr.themes.Soft(), title="Healthcare Assistant") as demo:
  chatbot = gr.Chatbot(
  chatbot = gr.Chatbot(


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://b2e1cf8e1ffad10204.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://b2e1cf8e1ffad10204.gradio.live




In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.push_to_hub("AubertGloire/healthcare-assistant-tinyllama")
tokenizer.push_to_hub("AubertGloire/healthcare-assistant-tinyllama")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...zxfa27a/model.safetensors:   3%|3         | 25.2MB /  762MB            

README.md: 0.00B [00:00, ?B/s]

CommitInfo(commit_url='https://huggingface.co/AubertGloire/healthcare-assistant-tinyllama/commit/4d38425855185cba87d5ace44dd6243057b9657a', commit_message='Upload tokenizer', commit_description='', oid='4d38425855185cba87d5ace44dd6243057b9657a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/AubertGloire/healthcare-assistant-tinyllama', endpoint='https://huggingface.co', repo_type='model', repo_id='AubertGloire/healthcare-assistant-tinyllama'), pr_revision=None, pr_num=None)