<a href="https://colab.research.google.com/github/alex-smith-uwec/NLP_Spring2025/blob/main/Starter_Medical_Questions_Assignment_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  NLP Assignment: Fine-Tuning a Transformer on Medical Question Pairs

In this assignment, you will fine-tune a transformer model to classify whether pairs of medical questions are paraphrases of each other.

Once you have everyting in place and before training, you should restart and change the runtime to TPU.

In [None]:
# Install necessary libraries
!pip install transformers datasets -q

In [None]:
##TODO: set random seed to your Blugold ID
seed=##
##Enter your name here:

## Step 1: Load the Dataset
Use the `datasets` library to load the [curaihealth medical_questions_pairs](https://huggingface.co/datasets/curaihealth/medical_questions_pairs) dataset.

In [None]:
from datasets import load_dataset
# TODO: Load the dataset

dataset = ##
dataset

##  Train/Validation/Test Split
The `medical_questions_pairs` dataset only provides a single training set. You need to create your own train, validation, and test sets.

We'll split the dataset into:
- **Train:** 80%
- **Validation:** 10%
- **Test:** 10%

Use `train_test_split` from the `datasets` library to do this.

In [None]:
from datasets import DatasetDict

# Step 1: Split into train + temp (val + test)
temp_split = dataset['train'].train_test_split(test_size=0.2, seed=seed)

# Step 2: Split temp into validation + test (50/50 of temp = 10% each)
val_test_split = temp_split['test'].train_test_split(test_size=0.5, seed=seed)

# Step 3: Combine splits into a DatasetDict
split_dataset = DatasetDict({
    'train': temp_split['train'],
    'validation': val_test_split['train'],
    'test': val_test_split['test']
})

split_dataset

In [None]:
## TODO: find an index  so that the corresponding validation question pair has label 0
idx_0=##
split_dataset['validation'][idx_0]


In [None]:
## TODO: find an index so that the corresponding validation question pair has label 1
idx_1=##
split_dataset['validation'][idx_1]

## Step 2: Explore and Preprocess
Examine the fields. Tokenize question pairs using a pretrained tokenizer.

In [None]:
from transformers import AutoTokenizer

 ##Choose a model checkpoint
checkpoint = 'microsoft/MiniLM-L12-H384-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Tokenization function
def tokenize_fn(example):
    return tokenizer(example['question_1'], example['question_2'], truncation=True, padding='max_length',max_length=256)

##Apply to dataset
tokenized = split_dataset.map(tokenize_fn, batched=True)
tokenized

In [None]:
print(tokenized['train'][0])

## Step 3: Load Model
Load a model for sequence classification.

In [None]:
from transformers import AutoModelForSequenceClassification

# TODO: Define the model with correct number of labels
model = ##

## Step 4: Define Training Arguments
Use Hugging Face `TrainingArguments` to configure training.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    report_to="none"
)

## Step 5: Define Trainer
Set up the `Trainer` object and begin training.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

In [None]:
from transformers import Trainer

##Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:

# Start training
trainer.train()

## Step 6: Evaluation
Evaluate and inspect results.

In [None]:
##Evaluate the model
metrics = trainer.evaluate()
print(metrics)

## Training Accuracy
Now that training is complete, let's evaluate the model on the training set to report training accuracy.

In [None]:
# Evaluate on training data
train_metrics = trainer.evaluate(tokenized["train"])
print(f"Training Accuracy: {train_metrics['eval_accuracy']:.4f}")