<a href="https://colab.research.google.com/github/dlvh/biods271/blob/main/BIODS271_2025_Homework_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 1: Evaluating the robustness of LLMs for medical question answering

### Overview

Large language models (LLMs) have shown impressive performance in answering medical questions, but their reliability can be inconsistent, especially when users phrase questions in non-standard ways. In real-world settings, patients and clinicians may introduce typos, grammatical errors, or stylistic variations in their queries, potentially leading to incorrect responses. In your first homework assignment, you will evaluate how robust an open-source LLM (FLAN-T5) is when faced with systematically perturbed medical questions from the MedQA dataset.

### Your task

You will develop an automated pipeline to modify MedQA questions while preserving their original meaning. Your goal is to identify systematic weaknesses in the model's responses by applying perturbations and evaluating their impact on the model's accuracy. Some examples of perturbations to consider: typos, paraphrasing, patient characteristics (e.g. ethnicity, gender, age).

❗Make sure to click File > Save a copy in Drive before you get started on this assignment. If you edit this notebook directly, your changes will not be saved.

### Install Python Packages

In [1]:
# Install and import necessary dependencies
!pip install datasets

from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
from typing import List, Dict

from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

In [2]:
import torch
import re
import random

### Load the MedQA dataset

We will be loading the MedQA dataset from [HuggingFace](https://huggingface.co/datasets/bigbio/med_qa). This dataset consists of multiple-choice question/answer pairs realted to solving medical problems. The questions have been collected from professional medical board exams, and in total we have 12723 questions split into train, validation, and test sets. Details on how this dataset was created can be found in the paper "[What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams](https://arxiv.org/abs/2009.13081)."

In [3]:
medqa = load_dataset("bigbio/med_qa", "med_qa_en_source")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

med_qa.py:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

bigbiohub.py:   0%|          | 0.00/19.3k [00:00<?, ?B/s]

The repository for bigbio/med_qa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/bigbio/med_qa.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


data_clean.zip:   0%|          | 0.00/132M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

/root/.cache/huggingface/datasets/downloads/extracted/454fe0fdcb22e71e2762c35bcbe1c9a23f7b0154ade8598b60ee0dc6db7d537a/data_clean/questions/US/train.jsonl


Generating test split: 0 examples [00:00, ? examples/s]

/root/.cache/huggingface/datasets/downloads/extracted/454fe0fdcb22e71e2762c35bcbe1c9a23f7b0154ade8598b60ee0dc6db7d537a/data_clean/questions/US/test.jsonl


Generating validation split: 0 examples [00:00, ? examples/s]

/root/.cache/huggingface/datasets/downloads/extracted/454fe0fdcb22e71e2762c35bcbe1c9a23f7b0154ade8598b60ee0dc6db7d537a/data_clean/questions/US/dev.jsonl


Let's look at an example question in the dataset! Feel free to play around with the index and split to view more examples

In [4]:
ind = 0
split = 'train'
example = medqa[split][ind]

print(f"Question: {example['question']}")
print("\nOptions:")
for i, option in enumerate(example['options']):
  print(f"  {option['key']}. {option['value']}")

print(f"\nCorrect Answer: {example['answer_idx']}. {example['answer']}")

Question: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?

Options:
  A. Ampicillin
  B. Ceftriaxone
  C. Ciprofloxacin
  D. Doxycycline
  E. Nitrofurantoin

Correct Answer: E. Nitrofurantoin


### Load the model
In this homework, we will be using the publically available FLAN-T5 model, which was released in the paper "[Scaling Instruction-Finetuned Language Models](https://arxiv.org/pdf/2210.11416)." This LLM is open-source, and fine-tuned on over 1,800 tasks. We can directly load the model weights from [HuggingFace](https://huggingface.co/docs/transformers/en/model_doc/flan-t5) without fine-tuning the model.

You are allowed to change the model. If you do so, please expain in your write-up what model you used and why you made this choice.

In [5]:
model_name = "google/flan-t5-large"

tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Let's run the model on an example question from the dataset and evaluate its performance (5 points)

In [6]:
# Function to generate an answer for a specific question
def get_model_answer(model, question: str, options: List) -> Dict:
  """
  Generate an answer for a question given the potential answer options

  Params:
    model: FLAN T5 model
    question (str): A question from the dataset
    options (list): A list of dictionaries of the possible answer options.
                    Each dictionary maps the answer index to the answer string.

  Returns:
    generated_answer (dict): A dictionary that contains the model's predicted answer to the question.
                             E.g. {'key': A, 'value': Disease X}
  """

  # TODO: your code here

  # Prompting
  option_str = " ".join([f"({opt['key']}) {opt['value']}" for opt in options])
  prompt = f"Question: {question}\nOptions: {option_str}\nAnswer:"
  # Encode the Prompt
  enc = tokenizer(prompt, return_tensors="pt")
  # Call Model
  out_ids = model.generate(**enc, max_new_tokens=16)
  # Decode the Output
  pred_text = tokenizer.decode(out_ids[0], skip_special_tokens=True).strip()
  ## DEBUGGING
  ##print(f"Prompt: {prompt}")
  ##print(f"Prediction: {pred_text}")
  # Search for Answer
  m = re.search(r"\b([A-E])\b", pred_text, flags=re.I)
  # If the model generates a letter option answer
  if m:
    pred_key = m.group(1).upper()
    for opt in options:
      if opt["key"] == pred_key:
        return opt
  # If the model generates a text option answer
  else:
    pred_lower = pred_text.lower()
    # convert into the letter option answer
    for opt in options:
      if opt["value"].lower() in pred_lower:
        return opt
  # If no valid model answer, return None
  return None

In [7]:
def evaluate_model(generated_answer: str, gt_answer: str) -> bool:
  """
  Determine if the generated answer is the same as the ground truth answer

  Params:
    generated_answer (str): Model's generated answer index (e.g 'A')
    gt_answer (str): Ground truth answer index (e.g 'A')

  Returns:
    bool: True if answers are the same, False otherwise
  """

  # TODO: your code here
  return generated_answer.strip().upper() == gt_answer.strip().upper()

In [8]:
# Once you fill in the above functions, run this cell

ind = 0
split = 'train'
example = medqa[split][ind]

question = example['question']
options = example['options']
answer = example['answer_idx']

# Given an example from the dataset, generate an answer using the model and then evaluate its correctness
generated_answer = get_model_answer(model, question, options)
correct_bool = evaluate_model(generated_answer['key'], answer)

print("Original question:", question)
print(f"\nModel's answer: {generated_answer['key']} ({generated_answer['value']})")
print("\nCorrect answer:", answer)
print("\nIs the model correct?", correct_bool)

Original question: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?

Model's answer: B (Ceftriaxone)

Correct answer: E

Is the model correct? False


In [9]:
example

{'meta_info': 'step2&3',
 'question': 'A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?',
 'answer_idx': 'E',
 'answer': 'Nitrofurantoin',
 'options': [{'key': 'A', 'value': 'Ampicillin'},
  {'key': 'B', 'value': 'Ceftriaxone'},
  {'key': 'C', 'value': 'Ciprofloxacin'},
  {'key': 'D', 'value': 'Doxycycline'},
  {'key': 'E', 'value': 'Nitrofurantoin'}]}

### Let's perturb the inputs! (25 points)
Now that you know how to run the model to generate an answer to a question and evaluate if that answer is correct, let's play around with the input questions. Define some perturbations, and write code that automatically applies those perturbations. Remember that we do not want to change the medical content in the questions, instead we want to simulate how a real-user might write their queries differently. Come up with **at least 3** different perturbations, and then systematically run those queries through the model (we recommend using a fixed set of 20-50 queries that you alter). Report the accuracy of the model on the original queries and in each of the perturbed cases.

You can write functions to introduce perturbations (like typos), or you can also use an LLM. Feel free to format your code however you like. You might want to consider modifying the above functions (or creating new functions), so that we can take in a series of questions, rather than one at a time.

For full points, we expect the following from your code:


1.   At least three clearly defined methods that automaticlaly apply perturbations to the questions
2.   Evaluation metrics detailing how well the model performs on the original questions compared to the perturbed questions
3. We encourage creativity with your perturbations! If all your perturbations are very naive (e.g. you introduce a typo character for all 3 of your perturbations), we will not be able to give you full points.

In [10]:
# TODO: define you perturbations, run the perturbed queries through the model, and evaluate the model's performance.

**Approach: Helper Function and Original Questions**

First, define a function that evaluates the model performance given a list of "examples," each example a dictionary object containing question, options, and the GT answer index (letter option). The function outputs the accuracy for this list of examples.

Using this function, we can first evaluate the accuracy of the model on the original questions.

In [11]:
def evaluate_model_on_examples(
        model,
        examples: List[Dict],
    ) -> float:
    """
    Args
    ----
    model      : the FLAN‑T5 model in this assignment
    examples   : list where each item is a dict including the following keys:
                 {"question": str,
                  "options" : List[{"key": str, "value": str}],
                  "answer_idx": str}

    Returns
    -------
    accuracy (float in [0,1])
    """
    n_correct = 0
    for ex in examples:
        pred = get_model_answer(model, ex["question"], ex["options"])
        if pred is None:
            continue
        if evaluate_model(pred["key"], ex["answer_idx"]):
            n_correct += 1
    return n_correct / len(examples)

In [19]:
random.seed(217)               # fix seed for reproducibility
num_samples = 40               # takes about 5 minutes on CPU

train_set = medqa["train"]
all_indices     = random.sample(range(len(train_set)), num_samples)
sample_examples = [train_set[i] for i in all_indices]

acc = evaluate_model_on_examples(model, sample_examples)

print("Randomly chosen indices:", all_indices)
print(f"Accuracy on these {num_samples} examples, original questions: {acc:.2%}")

Randomly chosen indices: [4696, 3860, 3022, 4122, 4092, 1990, 5586, 2452, 4403, 5308, 6947, 9008, 9565, 8964, 3464, 1526, 1344, 4711, 1316, 783, 9181, 5780, 1160, 4153, 5240, 4746, 7087, 2668, 5807, 7864, 8944, 6415, 2659, 9785, 4761, 5015, 6392, 9441, 1002, 6710]
Accuracy on these 40 examples, original questions: 35.00%


**Perturbation 1: Typo Simulation Based on Poisson Process**

To simulate how a user may interact with the LLM in the context of Medical Q&A, let us first consider the most prevalent potential perturbation, typo in the question. Assume the typo happen in the question following some Poisson Process with rate $\lambda$. Higher the value of $\lambda$, more expected typos in a given sentence (length window of input question texts).

The code below automatically apply typo transformations to the questions in the examples. We will then evaluate the model on these perturbed questions.


In [20]:
def introduce_typos_torch(text, typo_rate=0.1):
  """Introduces typos into a text string using a Poisson process with torch.

  Args:
    text: The input text string.
    typo_rate: The average number of typos per character (lambda).

  Returns:
    The text string with typos introduced.
  """

  text_with_typos = list(text)

  for i, char in enumerate(text):
    # Simulate typo event using torch
    num_typos = torch.poisson(torch.tensor(typo_rate)).item()

    if num_typos > 0:
      # Introduce a typo, assuming random typo for simplicity
      text_with_typos[i] = random.choice('abcdefghijklmnopqrstuvwxyz')

  return "".join(text_with_typos)

In [21]:
sample_examples[0]

{'meta_info': 'step2&3',
 'question': 'A 26-year-old G1P0 woman is brought to the emergency room by her spouse for persistently erratic behavior. Her spouse reports that she has been sleeping > 1 hour a night, and it sometimes seems like she’s talking to herself. She has maxed out their credit cards on baby clothes. The patient’s spouse reports this has been going on for over a month. Since first seeing a physician, she has been prescribed multiple first and second generation antipsychotics, but the patient’s spouse reports that her behavior has failed to improve. Upon examination, the patient is speaking rapidly and occasionally gets up to pace the room. She reports she is doing “amazing,” and that she is “so excited for the baby to get here because I’m going to be the best mom.” She denies illicit drug use, audiovisual hallucinations, or suicidal ideation. The attending psychiatrist prescribes a class of medication the patient has not yet tried to treat the patient’s psychiatric cond

In [22]:
# applying the perturbation: typo
# typo rate = 0.1. This is a large perturbation
sample_examples_typo = [
    {
        "question": introduce_typos_torch(ex["question"]),
        "options": ex["options"],
        "answer_idx": ex["answer_idx"]
    }
    for ex in sample_examples
]

In [23]:
sample_examples_typo[0]

{'question': 'A 26-year-old G1P0 woman zs brougvt toathe emergenzy room by hkrzppouse foe persistently erraticxbehaviorb Her spouse repotts tkat she has been sleeping >p1 hour a nighl, pnd it somewimes seems like she’s tamking to herself. She has maxed out their credit cvyts ou baby clothys. Thw patient’s spouse reports this has been going on for over a month. Since firut seeingja physiciab, she gas been prescribed myltiplx firstxand second generatzoj antipsyctotics, but the pakiunt’s spouee redoits that her behmvoprqhas failed to improve. Ugon examination, the patyent is sppahing lapidlyxand occasionally getm up go paee tve room. She rgports she is doicg kamazing,” and that she isuepb excsted forathe bayy to get here becayse q’m goinn tovbe tue best moht” She denies illicit drug use,paudiowisual hzllucinatiofs, or suicidas ideationx The sttenoing psybpiatrjstuprescribex a llass of medication the patientshas not petmtried to treat thenpptient’s psrchiatric condition. In terms of this n

In [24]:
# Evaluate

acc_typo = evaluate_model_on_examples(model, sample_examples_typo)

print(f"Accuracy on these {num_samples} examples, typo in questions: {acc_typo:.2%}")

Accuracy on these 40 examples, typo in questions: 20.00%


In [25]:
# Perturbation with less typos
# applying the perturbation: typo with lower rate at 0.05
# Makes 1 mistake on average when typing 20 characters
sample_examples_typo_lrate = [
    {
        "question": introduce_typos_torch(ex["question"], typo_rate=0.05),
        "options": ex["options"],
        "answer_idx": ex["answer_idx"]
    }
    for ex in sample_examples
]

In [26]:
sample_examples_typo_lrate[0]

{'question': 'A 26-year-old G1P0 woman is boought to the emergency room bo her bpoxse foy persistentlyveroatic behavior. Her spouse revorts thaj she haskoeenkslebjing > 1 hour a night, and it sometgmes seems lwke she’s talking to herself. She has maxed hut thwirpcredit cardm xn babyaclothes. The patient’s spouse reports mhvs has been goingoon for over a month. Since firstwseeing a physician, she has been presbribed multiple first andlsecond generation antipsychocics, but tae patient’s spouse reports that her fehavifr has zailed to lmppove. Upon examination, qhe patientois speaking rapidly and occasionally gets up to pace the ronm.wShe reports she iszdoing “amazizg,” and that shetis “so exciteh for the baby to ged here because I’m going to be the best mom.g She denies illicit drug uze, audiovisual hallucenations, or suicidal ideation. Theuattendinn psychiaprilt prewcriaes a clasi ofnmedication the patient has not yet tried to treat the patient’s psychiatrdc condition. In terms of thistn

In [27]:
# Evaluate Typo with lower rate

acc_typo_lrate = evaluate_model_on_examples(model, sample_examples_typo_lrate)

print(f"Accuracy on these {num_samples} examples, typo in questions: {acc_typo_lrate:.2%}")

Accuracy on these 40 examples, typo in questions: 35.00%


**Perturbation 2: Verbosity**

Verbosity is an area of frequent research in human-computer interactions. While we can prompt LLM to generate a less/more verbose version of the prompt, in this implementation, we experiment what would happen if when prompting the FLAN-T5 model, we change the input question based on a verbosity parameter.

For verbosity rate = 2, we copy and paste the question twice.

For verbosity rate between 1 and 2, each word is repeated with probability verbosity rate minus 1.

For verbosity rate less than 1, each word is omitted with probability equals to verbosity rate.

In [28]:
def adjust_verbosity(text, verbosity_rate=2):
    """Adjusts the verbosity of a text string based on the verbosity rate.

    Args:
        text: The input text string.
        verbosity_rate: The verbosity rate.

    Returns:
        The adjusted text string.
    """

    if verbosity_rate == 1:
        return text  # Return original text

    elif 1 < verbosity_rate < 2:
        words = text.split()
        modified_words = []
        for word in words:
            if random.random() < (verbosity_rate - 1):
                modified_words.extend([word, word])  # Repeat word with probability (verbosity_rate - 1)
            else:
                modified_words.append(word)
        return " ".join(modified_words)

    elif verbosity_rate == 2:
        return text + " " + text  # Identical copy of the original string

    elif verbosity_rate < 1:
        words = text.split()
        modified_words = []
        for word in words:
            if random.random() > verbosity_rate:  # Omit word with probability verbosity_rate
                modified_words.append(word)
        return " ".join(modified_words)

    else:
        return text  # Return original text for invalid verbosity_rate

In [29]:
# Experiment: verbosity rate = 2

# applying the perturbation: verbosity2
sample_examples_verbosity2 = [
    {
        "question": adjust_verbosity(ex["question"], verbosity_rate=2),
        "options": ex["options"],
        "answer_idx": ex["answer_idx"]
    }
    for ex in sample_examples
]
# Examine a sample
print(sample_examples_verbosity2[0])
print("---")
# Evaluate
acc_verbosity2 = evaluate_model_on_examples(model, sample_examples_verbosity2)
print(f"Accuracy on these {num_samples} examples, verbosity2: {acc_verbosity2:.2%}")

{'question': 'A 26-year-old G1P0 woman is brought to the emergency room by her spouse for persistently erratic behavior. Her spouse reports that she has been sleeping > 1 hour a night, and it sometimes seems like she’s talking to herself. She has maxed out their credit cards on baby clothes. The patient’s spouse reports this has been going on for over a month. Since first seeing a physician, she has been prescribed multiple first and second generation antipsychotics, but the patient’s spouse reports that her behavior has failed to improve. Upon examination, the patient is speaking rapidly and occasionally gets up to pace the room. She reports she is doing “amazing,” and that she is “so excited for the baby to get here because I’m going to be the best mom.” She denies illicit drug use, audiovisual hallucinations, or suicidal ideation. The attending psychiatrist prescribes a class of medication the patient has not yet tried to treat the patient’s psychiatric condition. In terms of this n

In [30]:
# Experiment: verbosity rate = 1.5

# applying the perturbation: verbosity1.5
sample_examples_verbosity15 = [
    {
        "question": adjust_verbosity(ex["question"], verbosity_rate=1.5),
        "options": ex["options"],
        "answer_idx": ex["answer_idx"]
    }
    for ex in sample_examples
]
# Examine a sample
print(sample_examples_verbosity15[0])
print("---")
# Evaluate
acc_verbosity15 = evaluate_model_on_examples(model, sample_examples_verbosity15)
print(f"Accuracy on these {num_samples} examples, verbosity rate = 1.5: {acc_verbosity15:.2%}")

{'question': 'A 26-year-old 26-year-old G1P0 G1P0 woman is is brought to the emergency room room by by her her spouse for for persistently persistently erratic behavior. behavior. Her spouse reports reports that that she she has been sleeping > 1 1 hour a night, and and it it sometimes sometimes seems seems like like she’s she’s talking talking to to herself. She has has maxed out their their credit credit cards cards on baby baby clothes. The patient’s patient’s spouse reports this this has has been going going on for over over a a month. month. Since first seeing a a physician, she has been prescribed multiple multiple first and and second second generation antipsychotics, antipsychotics, but the patient’s spouse reports that that her behavior behavior has failed failed to to improve. Upon Upon examination, examination, the patient is speaking speaking rapidly rapidly and occasionally occasionally gets up up to pace the the room. room. She reports reports she she is doing “amazing,” 

In [31]:
# Experiment: verbosity rate = 0.5

# applying the perturbation: verbosity0.5
sample_examples_verbosity05 = [
    {
        "question": adjust_verbosity(ex["question"], verbosity_rate=0.5),
        "options": ex["options"],
        "answer_idx": ex["answer_idx"]
    }
    for ex in sample_examples
]
# Examine a sample
print(sample_examples_verbosity05[0])
print("---")
# Evaluate
acc_verbosity05 = evaluate_model_on_examples(model, sample_examples_verbosity05)
print(f"Accuracy on these {num_samples} examples, verbosity rate = 0.5: {acc_verbosity05:.2%}")

{'question': 'A woman is to the emergency room her erratic Her spouse that she been hour a night, seems like to She maxed out cards on baby spouse reports has been a month. prescribed and second but the spouse has Upon patient is speaking and up reports she is doing “amazing,” and that she is for baby to get here because going mom.” She illicit use, hallucinations, or ideation. psychiatrist a class of the patient not yet tried treat psychiatric In of this medication, the is likely at increased', 'options': [{'key': 'A', 'value': 'Attention deficit hyperactivity disorder'}, {'key': 'B', 'value': 'Caudal regression syndrome'}, {'key': 'C', 'value': 'Ototoxicity'}, {'key': 'D', 'value': 'Renal defects'}, {'key': 'E', 'value': 'Right ventricular atrialization'}], 'answer_idx': 'E'}
---
Accuracy on these 40 examples, verbosity rate = 0.5: 27.50%


**Perturbation 3: Paraphrasing**

In [32]:
instruction = """
Paraphrase the question so that it is easier to understand.
Make sure that all details are preserved.
The output should be in similar length to the input.
The question is provided below
"""
instruction

'\nParaphrase the question so that it is easier to understand. \nMake sure that all details are preserved. \nThe output should be in similar length to the input.\nThe question is provided below\n'

In [33]:
def modify_text_with_instruction(model, text: str, instruction: str) -> str:
  """
  Modifies the input text based on the provided instruction using the FLAN T5 model.

  Args:
    model: The FLAN T5 model.
    text: The input text string.
    instruction: The instruction string for modifying the text.

  Returns:
    The modified text string generated by the model.
  """

  # Create prompt by combining instruction and text
  prompt = f"{instruction}: {text}"

  # Encode the prompt
  enc = tokenizer(prompt, return_tensors="pt")

  # Generate output using the model
  out_ids = model.generate(**enc)

  # Decode the output to get the modified text
  modified_text = tokenizer.decode(out_ids[0], skip_special_tokens=True).strip()

  return modified_text

In [34]:
# applying the perturbation: paraphrasing
sample_examples_paraphrased = [
    {
        "question": modify_text_with_instruction(model, ex["question"], instruction),
        "options": ex["options"],
        "answer_idx": ex["answer_idx"]
    }
    for ex in sample_examples
]
# Examine a sample
print(sample_examples_paraphrased[0])

{'question': 'What is the most likely side effect of the new medication?', 'options': [{'key': 'A', 'value': 'Attention deficit hyperactivity disorder'}, {'key': 'B', 'value': 'Caudal regression syndrome'}, {'key': 'C', 'value': 'Ototoxicity'}, {'key': 'D', 'value': 'Renal defects'}, {'key': 'E', 'value': 'Right ventricular atrialization'}], 'answer_idx': 'E'}


In [35]:
# Evaluate paraphrasing
acc_paraphrased = evaluate_model_on_examples(model, sample_examples_paraphrased)
print(f"Accuracy on these {num_samples} examples, paraphrased: {acc_paraphrased:.2%}")

Accuracy on these 40 examples, paraphrased: 25.00%


In [36]:
# somewhat suprising - maybe it is just randomly guessing.

In [37]:
# debugging
example = sample_examples_paraphrased[0]
question = example['question']
options = example['options']
answer = example['answer_idx']

# Given an example from the dataset, generate an answer using the model and then evaluate its correctness
generated_answer = get_model_answer(model, question, options)
correct_bool = evaluate_model(generated_answer['key'], answer)

print("Original question:", question)
print(f"\nModel's answer: {generated_answer['key']} ({generated_answer['value']})")
print("\nCorrect answer:", answer)
print("\nIs the model correct?", correct_bool)

Original question: What is the most likely side effect of the new medication?

Model's answer: E (Right ventricular atrialization)

Correct answer: E

Is the model correct? True


### Written analysis (20 points)

Answer the following questions about this assignment. Keep your responses concise, answering each question in just a few sentences.

1. **Robustness analysis (10 pts):** What types of perturbations did you implement? Which ones caused the model to fail, and in what ways? Were there any cases where the model remained robust?
2. **Improving model robustness (5 pts):** What strategies could improve the model's ability to handle variations in question phrasing? What recommendations would you give to developers building these models?
3. **Advice to users (5 pts):** How should doctors and patients approach using LLMs for medical advice, given their strengths and weaknesses?

[Your answer here]

#### Written Analysis

##### **Robustness Analysis**

FLAN-T5 did not achieve a good accuracy when answering MedQA. In our sample evaluation, the accuracy is about 35%. Compared with larger models (especially commercial models such as ChatGPT models or Gemini Models), this accuracy is not high, although higher than a random guessing baseline (20%, as there are five options to choose from).

I implemented three types of perturbations, (1) introducing typos in the question; (2) making the question more verbose by repeating words, or less verbose by randomly dropping words; and (3) Using FLAN-T5 again to paraphrase the question.

Perturbation (3), using FLAN-T5 to paraphrase the question, decreased the accuracy to 25%. This is mainly because FLAN-T5 did not excel in paraphrasing the question, or following the prompt instruction. Even in the provided prompt where I explicitly asked the model to retain all details and provide output in similar length, the generated output is very succint, and contains almost none of the details in the original question. Argubaly we can see a decrease in performance.

Perturbation (2), changing verbosity, did not significantly impact the model performance when more verbose questions are provided. Randomly dropping words decreased the model performance to 27.5%, which is expected.

Perturbation (1), introducing typos, is a bit more interesting. I used a Poisson Process to simulate how clinicians may commit typos in real settings. For rate = 0.1 (extreme case, on average one typo per 10 characters), the accuracy dropped to 20%, which is just as good as randomly guessing. However, *the model remained robust* for rate = 0.05 (on average one typo per 20 characters), and the accuracy is the same as the original model at 35%. This is possibly due to robustness in the FLAN-T5 tokenizer in which minor typos can be handled.

##### **Improving Model Robustness**

To build models suitable for MedQA, I think the best approach would still be *instruction fine-tuning*. Data augmentation (maybe even including examples where typos are simulated in the questions) and targeted fine tuning would update the weights of the model to make it more specialized with the knowledge of MedQA, which will achieve a better performance in the test dataset. Furthermore, we can borrow the idea of contrastive learning to improving the model robustness, rewarding the model for choosing the same answer across different variations of the question phrasing (typo, different levels of details, etc.)

##### **Advice to users**

LLMs can be helpful, but as we can see from this simple example, the accuracy is not high, and still not perfect even for commercial grade LLMs. As such, I think when interacting with LLMs, clinicians should proceed with caution, and also include human checks when there is high stake involved (e.g. real clinical decision making, treatment planning, etc.). Clinicials should also LLM inference results as only a input in the human clinical decision process, and always check decisions against current medical guidelines.

As for patients, while LLMs can help with patient education, patients should view LLM outputs as supplementary education materials rather than medical directives.






## Submission Guidelines

Upload this colab notebook and your writeup to the Canvas assignment.

When you have completed this colab notebook, please export to PDF by selecting File > Print > Save as PDF. **Please confirm that the outputs of all code cells are visible in the PDF.** Then, upload your PDF on Canvas.