# Commonsense MCQA

Multiple choice question answering is a common format for evaluating a model's reasoning ability. This notebook benchmarks few-shot prompting against a LoRA adapter (trained with DPO) on the [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) dataset. We sweep over the number of (positive) few-shot examples and study how accuracy scales relative to the fine-tuned baseline across two models.

### Runtime Estimate

> **Estimated Time:** 2-3 hours (fine-tuning two models on ~39k preference pairs)  
> **Device:** NVIDIA H100 GPU (80GB VRAM)

Times are approximate and vary based on dataset size, number of sweeps, and model configuration. Adjust parameters in the cells below to modify runtime.

## Setup

In [1]:
import json
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import transformers
from datasets import Dataset, load_dataset
from peft import PeftType

from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import flatten_profiles, get_param_values, summarize_by_config
from aisteer360.evaluation.utils.viz_utils import plot_sensitivity, plot_tradeoff

transformers.logging.set_verbosity_error()

MODELS = [
    "Qwen/Qwen2.5-0.5B-Instruct",
    "Qwen/Qwen2.5-1.5B-Instruct",
]

NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_commonsense_mcqa"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)

LETTERS = "ABCDE"

  from .autonotebook import tqdm as notebook_tqdm


## Loading the data

We load the [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) dataset from Hugging Face. The `validation` split is used for evaluation and the `train` split is used for steering (few-shot example pools and DPO training data).

In [2]:
csqa = load_dataset("tau/commonsense_qa")

eval_split = csqa["validation"]
steer_split = csqa["train"]

print(f"Evaluation split: {len(eval_split)} questions")
print(f"Steering split: {len(steer_split)} questions")

Evaluation split: 1221 questions
Steering split: 9741 questions


The `CommonsenseMCQA` use case expects each evaluation record to contain the question text, the correct answer text, 
and the full list of choices (so that it can shuffle them across runs to measure positional bias). We build these records 
directly from the `validation` split.

In [3]:
eval_records = []
for row in eval_split:
    correct_idx = LETTERS.index(row["answerKey"])
    choices = row["choices"]["text"]
    eval_records.append({
        "id": row["id"],
        "question": row["question"],
        "answer": choices[correct_idx],
        "choices": choices,
    })

len(eval_records), eval_records[0]

(1221,
 {'id': '1afa02df02c908a558b4036e80242fac',
  'question': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?',
  'answer': 'bank',
  'choices': ['bank', 'library', 'department store', 'mall', 'new york']})

## Building the use case

The use case of interest has already been constructed via the [use case](../../../docs/tutorials/add_new_use_case.md) 
tutorial and is available at `aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py`. We pass in `eval_records`
as the evaluation data for the use case. 

In [4]:
commonsense_mcqa = CommonsenseMCQA(
    evaluation_data=eval_records,
    evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
    num_samples=50,
    num_shuffling_runs=20
)

Two custom metrics have been created for the use case: 
- `MCQAAccuracy`: measures the accuracy statistics of each question (across trials)
- `MCQAPositionalBias`: measures the positional bias (via deviation from the uniform distribution across runs)

To facilitate computation of these statistics, the use case accepts a keyword argument `num_shuffling_runs` dictating 
how many times each question should be presented to the (steered) model under a randomized ordering of the choices. 
We restrict the number of evaluation datapoints to `num_samples=50` for speed.

## Preparing the steering data

Both steering methods draw from the `train` (steer) split and share a common MCQA prompt format: a question with lettered choices, expecting a single letter response. We define this format once and reuse it for both the few-shot example pools and the DPO preference pairs.

In [5]:
def format_mcqa_prompt(question: str, choices: list[str]) -> str:
    lines = ["You will be given a multiple-choice question and asked to select from a set of choices."]
    lines.append(f"\nQuestion: {question}\n")
    for i, choice in enumerate(choices):
        lines.append(f"{LETTERS[i]}. {choice}")
    lines.append("\nPlease only print the letter corresponding to your choice.")
    lines.append("\nAnswer:")
    return "\n".join(lines)

### Few-shot example pools

For FewShot, we build positive and negative example pools from the training split. Each example is a formatted MCQA prompt paired with a letter answer, matching the format the model will see at evaluation time.

In [6]:
positive_pool = []
negative_pool = []

for row in steer_split:
    choices = row["choices"]["text"]
    correct_idx = LETTERS.index(row["answerKey"])
    prompt = format_mcqa_prompt(row["question"], choices)

    wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
    positive_pool.append({"prompt": prompt, "answer": LETTERS[correct_idx]})
    negative_pool.append({"prompt": prompt, "answer": LETTERS[wrong_indices[0]]})

print(f"Few-shot pools: {len(positive_pool)} positive, {len(negative_pool)} negative")
print(f"\nExample positive:")
print(f"  Prompt: {positive_pool[0]['prompt'][:200]}...")
print(f"  Answer: {positive_pool[0]['answer']}")

Few-shot pools: 9741 positive, 9741 negative

Example positive:
  Prompt: You will be given a multiple-choice question and asked to select from a set of choices.

Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the schoo...
  Answer: A


### DPO preference pairs

For DPO, we create preference pairs using the same prompt format. Each pair contrasts the correct letter against an incorrect one. To increase training diversity, we create up to four pairs per question by contrasting the correct answer against each wrong answer.

In [7]:
dpo_pairs = []
for row in steer_split:
    choices = row["choices"]["text"]
    correct_idx = LETTERS.index(row["answerKey"])
    prompt = format_mcqa_prompt(row["question"], choices)

    wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
    for wrong_idx in wrong_indices[:4]:
        dpo_pairs.append({
            "prompt": prompt,
            "chosen": LETTERS[correct_idx],
            "rejected": LETTERS[wrong_idx],
        })

train_ds = Dataset.from_list(dpo_pairs)

print(f"Created {len(train_ds)} DPO preference pairs from {len(steer_split)} questions")
print(f"\nExample pair:")
print(f"  Prompt: {train_ds[0]['prompt'][:200]}...")
print(f"  Chosen: {train_ds[0]['chosen']}")
print(f"  Rejected: {train_ds[0]['rejected']}")

Created 38964 DPO preference pairs from 9741 questions

Example pair:
  Prompt: You will be given a multiple-choice question and asked to select from a set of choices.

Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the schoo...
  Chosen: A
  Rejected: B


## Defining the controls

### FewShot with ControlSpec

One of the goals of the invesitgation in this notebook is to explore how the number of (in-context) examples impacts model behavior. We use the toolkit's `ControlSpec` class to sweep over different values of `k_positive`. We fix `k_negative=0` to isolate the effect of positive examples.

In [8]:
few_shot_spec = ControlSpec(
    control_cls=FewShot,
    params={
        "selector_name": "random",
        "positive_example_pool": positive_pool,
        "negative_example_pool": negative_pool,
        "k_negative": 0,
    },
    vars=[{"k_positive": k} for k in [1, 5, 10, 25, 50]],
    name="FewShot",
)

### DPO with LoRA

The DPO-LoRA control fine-tunes a LoRA adapter using the preference pairs (`dpo_pairs`) we created above. The two models have slightly different training requirements; so we create a convenience function that populates the controls as a function of configs.

In [9]:
DPO_CONFIGS = {
    "Qwen/Qwen2.5-0.5B-Instruct": {
        "learning_rate": 5e-5,
        "num_train_epochs": 5,
    },
    "Qwen/Qwen2.5-1.5B-Instruct": {
        "learning_rate": 2e-5,
        "num_train_epochs": 3,
    }
}


def create_dpo_control(model_name: str) -> DPO:
    """Create a DPO control with model-specific hyperparameters."""
    short_name = model_name.split("/")[-1]
    config = DPO_CONFIGS.get(model_name, DPO_CONFIGS["Qwen/Qwen2.5-0.5B-Instruct"])

    return DPO(
        train_dataset=train_ds,

        # DPO / TRL config
        output_dir=NOTEBOOK_DIR / f"trl_models/{short_name}-DPO-Lora-Steer",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=config["num_train_epochs"],
        learning_rate=config["learning_rate"],
        beta=0.1,
        loss_type="sigmoid",
        max_length=512,
        max_prompt_length=450,
        disable_dropout=True,
        logging_steps=200,
        save_strategy="no",
        report_to="none",
        seed=123,

        # LoRA config
        use_peft=True,
        peft_type=PeftType.LORA,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        adapter_name="dpo",
        merge_lora_after_train=False,
    )

## Running the benchmark

The benchmark compares three steering approaches across multiple model sizes:
- **baseline**: Unsteered model
- **few_shot_sweep**: FewShot with varying `k_positive` (1, 5, 10, 25, 50)
- **dpo_lora**: DPO-trained LoRA adapter

We run with `num_trials=5` to capture statistical variability across generation runs (at the cost of slower execution).

In [None]:
all_profiles = {}

for model_name in MODELS:
    short_name = model_name.split("/")[-1]
    print(f"Running benchmark for {short_name}")

    dpo_lora = create_dpo_control(model_name)

    benchmark = Benchmark(
        use_case=commonsense_mcqa,
        base_model_name_or_path=model_name,
        steering_pipelines={
            "baseline": [],
            "few_shot_sweep": [few_shot_spec],
            "dpo_lora": [dpo_lora],
        },
        gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
        device_map="auto",
        num_trials=5
    )

    profiles = benchmark.run()
    all_profiles[short_name] = profiles

    benchmark.export(profiles, save_dir=f"./profiles/{short_name}/")

Running benchmark for Qwen2.5-0.5B-Instruct
Running pipeline: baseline...
done.
Running pipeline: few_shot_sweep...
Running configuration 1...
Running configuration 2...
Running configuration 3...
Running configuration 4...
Running configuration 5...
done.
Running pipeline: dpo_lora...


Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:00<00:00, 52982.88 examples/s]
Extracting prompt in train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:01<00:00, 33927.36 examples/s]
Applying chat template to train dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:00<00:00, 40298.68 examples/s]
Tokenizing train dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:08<00:00, 4588.73 examples/s]
Train dataset reference log probs: 100%|████████████████████████████████████████████████████████████████████████████████████

{'loss': 0.4405, 'grad_norm': 3.811967372894287, 'learning_rate': 4.918308702791461e-05, 'rewards/chosen': -0.37143298983573914, 'rewards/rejected': -1.545861840248108, 'rewards/accuracies': 0.7996875047683716, 'rewards/margins': 1.1744288206100464, 'logps/chosen': -35.37575912475586, 'logps/rejected': -48.15895462036133, 'logits/chosen': -2.738180637359619, 'logits/rejected': -2.7275230884552, 'epoch': 0.0821186614658181}
{'loss': 0.3837, 'grad_norm': 7.952126979827881, 'learning_rate': 4.8362068965517246e-05, 'rewards/chosen': 0.06859302520751953, 'rewards/rejected': -1.583513617515564, 'rewards/accuracies': 0.8346874713897705, 'rewards/margins': 1.652106761932373, 'logps/chosen': -30.929655075073242, 'logps/rejected': -48.548828125, 'logits/chosen': -2.9601025581359863, 'logits/rejected': -2.9752004146575928, 'epoch': 0.1642373229316362}
{'loss': 0.348, 'grad_norm': 7.980932235717773, 'learning_rate': 4.754105090311987e-05, 'rewards/chosen': 0.14456726610660553, 'rewards/rejected': 

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:00<00:00, 52268.51 examples/s]
Extracting prompt in train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:01<00:00, 34327.79 examples/s]
Applying chat template to train dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:00<00:00, 40113.29 examples/s]
Tokenizing train dataset: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38964/38964 [00:08<00:00, 4586.08 examples/s]
Train dataset reference log probs: 100%|████████████████████████████████████████████████████████████████████████████████████

{'loss': 0.3183, 'grad_norm': 2.4701852798461914, 'learning_rate': 1.945539135194308e-05, 'rewards/chosen': -0.8776431083679199, 'rewards/rejected': -2.9141292572021484, 'rewards/accuracies': 0.8884375095367432, 'rewards/margins': 2.0364861488342285, 'logps/chosen': -46.047210693359375, 'logps/rejected': -70.37177276611328, 'logits/chosen': -2.2215988636016846, 'logits/rejected': -2.2087676525115967, 'epoch': 0.0821186614658181}
{'loss': 0.2653, 'grad_norm': 7.1405205726623535, 'learning_rate': 1.8908045977011497e-05, 'rewards/chosen': -0.7136953473091125, 'rewards/rejected': -3.4527807235717773, 'rewards/accuracies': 0.8934375047683716, 'rewards/margins': 2.7390854358673096, 'logps/chosen': -44.42201232910156, 'logps/rejected': -75.73184204101562, 'logits/chosen': -2.2573461532592773, 'logits/rejected': -2.3170762062072754, 'epoch': 0.1642373229316362}
{'loss': 0.2384, 'grad_norm': 5.1630778312683105, 'learning_rate': 1.8360700602079915e-05, 'rewards/chosen': -0.4649747610092163, 'rew

## Analysis

We now analyze the benchmark results across both models.

First, we flatten the nested profiles into a single DataFrame with one row per trial (using the toolkit's utility `flatten_profiles`), then aggregate across trials to get mean and standard deviation.

In [None]:
dfs = []
for model_name, profiles in all_profiles.items():
    df = flatten_profiles(
        profiles,
        metric_accessors={
            "accuracy": ("MCQAAccuracy", "question_mean"),
            "positional_bias": ("MCQAPositionalBias", "mean"),
        }
    )
    df["model"] = model_name
    df["k_positive"] = get_param_values(df, "FewShot", "k_positive")
    dfs.append(df)

runs_df = pd.concat(dfs, ignore_index=True)
runs_df[["model", "pipeline", "trial_id", "k_positive", "accuracy", "positional_bias"]]

Next, we summarize by configuration (aggregating across trials) then add the `k_positive` value for each of the few-shot rows.

In [None]:
summary_df = summarize_by_config(
    runs_df,
    metric_cols=["accuracy", "positional_bias"],
    group_cols=["model", "pipeline", "config_id"]
)

k_map = runs_df.groupby(["model", "pipeline", "config_id"])["k_positive"].first()
summary_df["k_positive"] = summary_df.apply(
    lambda row: k_map.get((row["model"], row["pipeline"], row["config_id"]), np.nan), axis=1
)

summary_df[["model", "pipeline", "k_positive", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].round(3)

### DPO vs. FewShot

We now examine how the DPO-LoRA control performs in comparison to FewShot, particularly as we scale the number of (positive) examples. Both the DPO-LoRA control and baseline (unsteered) pipelines are shown as horizontal reference lines passed in using the `compare_to_pipelines` argument.

In [None]:
few_shot_df = summary_df[summary_df["pipeline"] == "few_shot_sweep"].copy()
few_shot_df = few_shot_df.sort_values(["model", "k_positive"])

# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
ylim_accuracy = (max(0, all_accuracy.min() - 0.1), min(1, all_accuracy.max() + 0.1))

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 4))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    # extract data under each pipeline
    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]

    # individual trial data (for scatter overlay)
    model_trials = runs_df[(runs_df["model"] == short_name) & (runs_df["pipeline"] == "few_shot_sweep")]
    
    plot_sensitivity(
        swept=model_swept,
        metric="accuracy",
        sweep_col="k_positive",
        per_trial_data=model_trials,
        compare_to_pipelines=[
            ("baseline", model_baseline),
            ("DPO-LoRA", model_dpo),
        ],
        ax=ax,
        metric_label="accuracy",
        sweep_label="k_positive",
        title=short_name,
        ylim=ylim_accuracy,
    )

fig.savefig(FIGURE_DIR / "sensitivity_accuracy.png", bbox_inches="tight", dpi=150)
plt.show()

We can see that fine-tuning (under DPO-LoRA) creates a noticeable jump in performance for the 0.5B model, and a smaller, but noticeable jump for the 1.5B model. The FewShot control does provide some small accuracy gain but only for moderate number of examples (gains diminish as the number of examples increase).

### Accuracy vs positional bias tradeoff

We examine whether there is a tradeoff between accuracy and positional bias across methods. The FewShot configurations are colored by `k_positive`, with the baseline shown as a black X marker and DPO-LoRA as a red square. The Pareto frontier indicates configurations that are not dominated by any other.

In [None]:
# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
all_bias = runs_df["positional_bias"].dropna()
xlim_tradeoff = (max(0, all_accuracy.min() - 0.05), min(1, all_accuracy.max() + 0.05))
ylim_tradeoff = (max(0, all_bias.min() - 0.02), all_bias.max() + 0.02)

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 5))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]

    plot_tradeoff(
        swept=model_swept,
        x_metric="accuracy",
        y_metric="positional_bias",
        sweep_col="k_positive",
        compare_to_pipelines=[
            ("baseline", model_baseline),
            ("DPO-LoRA", model_dpo),
        ],
        ax=ax,
        x_label="accuracy",
        y_label="positional bias",
        sweep_label="k_positive",
        title=short_name,
        show_pareto=True,
        maximize_x=True,
        maximize_y=False,
        xlim=xlim_tradeoff,
        ylim=ylim_tradeoff,
    )

fig.savefig(FIGURE_DIR / "tradeoff.png", bbox_inches="tight", dpi=150)
plt.show()

The interaction between the number of positive examples and the trade-off between positional bias and accuracy is fairly subtle/minor. In the 0.5B model, any number of positive examples causes the positional bias to jump (with a slight increase in accuracy). Interestingly it appears that when the 1.5B model is prompted with a small number of positive examples, positional bias does increase with accuracy, but falls again as the accuracy gains disappear. Generally, the DPO-LoRA control achieves the best trade-off under both models.

### Summary table

The table below summarizes all configurations ranked by accuracy for all methods/models.

In [None]:
method_order = ["baseline", "FewShot (k=1)", "FewShot (k=5)", "FewShot (k=10)", "FewShot (k=25)", "FewShot (k=50)", "DPO-LoRA"]

summary_table = summary_df.copy()
summary_table["method"] = summary_table.apply(
    lambda row: "baseline" if row["pipeline"] == "baseline"
    else "DPO-LoRA" if row["pipeline"] == "dpo_lora"
    else f"FewShot (k={int(row['k_positive'])})",
    axis=1
)

model_order = [m.split("/")[-1] for m in MODELS]
summary_table["model_order"] = summary_table["model"].apply(lambda m: model_order.index(m) if m in model_order else len(model_order))
summary_table["method_order"] = summary_table["method"].apply(lambda m: method_order.index(m) if m in method_order else len(method_order))

display_df = summary_table.sort_values(["model_order", "method_order"])[
    ["model", "method", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]
].copy()
display_df.columns = ["model", "method", "trials", "accuracy (mean)", "accuracy (std)", "pos bias (mean)", "pos bias (std)"]

display_df.style.format({
    "accuracy (mean)": "{:.1%}",
    "accuracy (std)": "{:.1%}",
    "pos bias (mean)": "{:.3f}",
    "pos bias (std)": "{:.3f}",
}).background_gradient(subset=["accuracy (mean)"], cmap="RdYlGn")

## Takeaways

This notebook compared the effectiveness of LoRA adapters with few-shot learning on a commonsense MCQA task. For the commonsense MCQA task under the models studied (`Qwen/Qwen2.5-0.5B-Instruct` and `Qwen/Qwen2.5-1.5B-Instruct`), fine-tuning significantly outperforms FewShot in the smaller model, but less so in the larger model. There does appear to be a sweet spot in the number of positive examples (accuracy-wise) for both models.