# Commonsense MCQA

Multiple choice question answering is a common format for evaluating a model's reasoning ability. This notebook benchmarks steering methods on the [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) dataset, comparing few-shot prompting against a LoRA adapter trained with DPO. We sweep over the number of few-shot examples and study how accuracy scales relative to the fine-tuned baseline across a few different models.

### Runtime Estimate

> **Estimated Time:** 1 hour (approx. 20 minutes per each of the three models)  
> **Device:** NVIDIA H100 GPU (80GB VRAM)

Times are approximate and vary based on dataset size, number of sweeps, and model configuration. Adjust parameters in the cells below to modify runtime.

## Setup

In [None]:
import json
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import transformers
from datasets import Dataset
from peft import PeftType

from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import flatten_profiles, get_param_values, summarize_by_config
from aisteer360.evaluation.utils.viz_utils import plot_sensitivity, plot_tradeoff

transformers.logging.set_verbosity_error()

MODELS = [
    "Qwen/Qwen2.5-0.5B-Instruct",
    "Qwen/Qwen2.5-1.5B-Instruct",
    "Qwen/Qwen2.5-3B-Instruct",
]

NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_commonsense_mcqa"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


## Building the use case

The use case of interest has already been constructed via the [use case](../../../docs/tutorials/add_new_use_case.md) tutorial and is available at `aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py`.

In [2]:
commonsense_mcqa = CommonsenseMCQA(
    evaluation_data=NOTEBOOK_DIR / "data/evaluation_qa.jsonl",
    evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
    num_shuffling_runs=20,
    num_samples=50
)

Two custom metrics have been created for the use case: `MCQAAccuracy` which measures the accuracy statistics of each question (across trials), and `MCQAPositionalBias` which measures the positional bias (via deviation from the uniform distribution across runs). To facilitate computation of these statistics, the use case accepts a keyword argument `num_shuffling_runs` dictating how many times each question should be presented to the (steered) model under a randomized ordering of the choices. We restrict the number of evaluation datapoints to `num_samples=50` for speed.

## Preparing the steering data

The benchmark uses steering data consisting of triples `(question, answer_chosen, answer_rejected)` extracted from the CommonsenseQA dataset.

In [3]:
with open(NOTEBOOK_DIR / "data/steer_qa.jsonl", "r") as f:
    steering_data = [json.loads(line) for line in f]

len(steering_data), steering_data[0]

(4871,
 {'id': '01beaf20-82aa-40b0-8b08-ee08b94e6666',
  'question': 'The spirit ascended to the after life, so what was it leaving?',
  'answer_chosen': 'human being',
  'answer_rejected': 'cemetary'})

For the `FewShot` control, we need to create example pools:

In [4]:
positive_pool = [{"question": row["question"], "answer": row["answer_chosen"]} for row in steering_data]
negative_pool = [{"question": row["question"], "answer": row["answer_rejected"]} for row in steering_data]

len(positive_pool), len(negative_pool)

(4871, 4871)

## Defining the controls

### FewShot with ControlSpec

Instead of using a fixed number of examples, we use `ControlSpec` to sweep over different values of `k_positive`. We fix `k_negative=0` to isolate the effect of positive examples.

In [5]:
few_shot_spec = ControlSpec(
    control_cls=FewShot,
    params={
        "selector_name": "random",
        "positive_example_pool": positive_pool,
        "negative_example_pool": negative_pool,
        "k_negative": 0,
    },
    vars=[{"k_positive": k} for k in [1, 5, 10, 25]],
    name="FewShot",
)

### DPO with LoRA

The DPO-LoRA control serves as our target to beat. It uses the same steering data to fine-tune a LoRA adapter. 

In [6]:
train_ds = Dataset.from_list([
    {"prompt": row["question"], "chosen": row["answer_chosen"], "rejected": row["answer_rejected"]}
    for row in steering_data
])

Since we are comparing the behavior of multiple models (of different sizes) in this notebook, it is generally advised to use different training parameters. Smaller models use higher learning rates and more epochs for faster convergence, while larger models use more conservative settings (lower learning rate, higher beta, larger LoRA rank) to avoid overfitting on the relatively small training set (~5k examples). We encode these differences in `DPO_CONFIGS` and create a simple function wrapper function (`create_dpo_control`) to instantiate the respective controls.

In [None]:
# model-specific DPO hyperparameters
DPO_CONFIGS = {
    "Qwen/Qwen2.5-0.5B-Instruct": {
        "learning_rate": 5e-5,
        "beta": 0.05,
        "num_train_epochs": 3,
        "r": 8,
        "lora_alpha": 16,
    },
    "Qwen/Qwen2.5-1.5B-Instruct": {
        "learning_rate": 2e-5,
        "beta": 0.1,
        "num_train_epochs": 2,
        "r": 16,
        "lora_alpha": 32,
    },
    "Qwen/Qwen2.5-3B-Instruct": {
        "learning_rate": 1e-5,
        "beta": 0.1,
        "num_train_epochs": 2,
        "r": 16,
        "lora_alpha": 32,
    },
}


def create_dpo_control(model_name: str) -> DPO:
    """Create a DPO control with model-specific hyperparameters."""
    short_name = model_name.split("/")[-1]
    config = DPO_CONFIGS.get(model_name, DPO_CONFIGS["Qwen/Qwen2.5-0.5B-Instruct"])

    return DPO(
        train_dataset=train_ds,

        # DPO / TRL config (model-specific)
        output_dir=NOTEBOOK_DIR / f"trl_models/{short_name}-DPO-Lora-Steer",
        per_device_train_batch_size=4,
        num_train_epochs=config["num_train_epochs"],
        learning_rate=config["learning_rate"],
        beta=config["beta"],
        loss_type="sigmoid",
        max_length=1024,
        max_prompt_length=512,
        disable_dropout=True,
        logging_steps=100,
        save_strategy="no",
        report_to="none",
        seed=123,

        # LoRA config (model-specific)
        use_peft=True,
        peft_type=PeftType.LORA,
        r=config["r"],
        lora_alpha=config["lora_alpha"],
        target_modules=["q_proj", "v_proj"],
        adapter_name="dpo",
        merge_lora_after_train=False,
    )

## Running the benchmark

The benchmark compares three steering approaches across multiple model sizes:
- **baseline**: Unsteered model
- **few_shot_sweep**: FewShot with varying `k_positive` (1, 5, 10, 25)
- **dpo_lora**: DPO-trained LoRA adapter

We run with `num_trials=5` to capture statistical variability across generation runs (at the cost of slower execution).

In [8]:
all_profiles = {}

for model_name in MODELS:
    short_name = model_name.split("/")[-1]
    print(f"Running benchmark for {short_name}")

    dpo_lora = create_dpo_control(model_name)

    benchmark = Benchmark(
        use_case=commonsense_mcqa,
        base_model_name_or_path=model_name,
        steering_pipelines={
            "baseline": [],
            "few_shot_sweep": [few_shot_spec],
            "dpo_lora": [dpo_lora],
        },
        gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
        device_map="auto",
        num_trials=5
    )

    profiles = benchmark.run()
    all_profiles[short_name] = profiles

    # export per-model results
    benchmark.export(profiles, save_dir=f"./profiles/{short_name}/")

Running benchmark for Qwen2.5-0.5B-Instruct
Running pipeline: baseline...
done.
Running pipeline: few_shot_sweep...
Running configuration 1...
Running configuration 2...
Running configuration 3...
Running configuration 4...
done.
Running pipeline: dpo_lora...


Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 50578.07 examples/s]
Extracting prompt in train dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 33117.88 examples/s]
Applying chat template to train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 39726.63 examples/s]
Tokenizing train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 6486.53 examples/s]
Train dataset reference log probs: 100%|████████████████████████████████████████

{'loss': 0.6888, 'grad_norm': 1.1701231002807617, 'learning_rate': 4.8645320197044334e-05, 'rewards/chosen': 0.03167436644434929, 'rewards/rejected': 0.02267163060605526, 'rewards/accuracies': 0.6025000214576721, 'rewards/margins': 0.009002731181681156, 'logps/chosen': -40.01519775390625, 'logps/rejected': -42.47361373901367, 'logits/chosen': -1.2978674173355103, 'logits/rejected': -1.2629873752593994, 'epoch': 0.08210180623973727}
{'loss': 0.6657, 'grad_norm': 1.5464173555374146, 'learning_rate': 4.7276956759715385e-05, 'rewards/chosen': 0.19636569917201996, 'rewards/rejected': 0.13458965718746185, 'rewards/accuracies': 0.6875, 'rewards/margins': 0.061776045709848404, 'logps/chosen': -36.535667419433594, 'logps/rejected': -39.877471923828125, 'logits/chosen': -1.5315977334976196, 'logits/rejected': -1.519681692123413, 'epoch': 0.16420361247947454}
{'loss': 0.6149, 'grad_norm': 4.150752067565918, 'learning_rate': 4.590859332238643e-05, 'rewards/chosen': 0.41691139340400696, 'rewards/re

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 52394.78 examples/s]
Extracting prompt in train dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 33447.53 examples/s]
Applying chat template to train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 39230.94 examples/s]
Tokenizing train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 6544.56 examples/s]
Train dataset reference log probs: 100%|████████████████████████████████████████

{'loss': 0.6799, 'grad_norm': 0.747081458568573, 'learning_rate': 4.8645320197044334e-05, 'rewards/chosen': 0.0847334936261177, 'rewards/rejected': 0.05650382116436958, 'rewards/accuracies': 0.6924999952316284, 'rewards/margins': 0.028229672461748123, 'logps/chosen': -41.07448196411133, 'logps/rejected': -43.822601318359375, 'logits/chosen': 0.8180859088897705, 'logits/rejected': 0.8961646556854248, 'epoch': 0.08210180623973727}
{'loss': 0.6152, 'grad_norm': 1.4636225700378418, 'learning_rate': 4.7276956759715385e-05, 'rewards/chosen': 0.31208381056785583, 'rewards/rejected': 0.12476341426372528, 'rewards/accuracies': 0.7699999809265137, 'rewards/margins': 0.18732041120529175, 'logps/chosen': -36.116390228271484, 'logps/rejected': -43.04884719848633, 'logits/chosen': 0.10593454539775848, 'logits/rejected': 0.21649768948554993, 'epoch': 0.16420361247947454}
{'loss': 0.5299, 'grad_norm': 3.6608152389526367, 'learning_rate': 4.590859332238643e-05, 'rewards/chosen': 0.4018767476081848, 're

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.29s/it]


done.
Running pipeline: few_shot_sweep...
Running configuration 1...
Running configuration 2...
Running configuration 3...
Running configuration 4...
done.
Running pipeline: dpo_lora...


Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.39s/it]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 52075.60 examples/s]
Extracting prompt in train dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 33352.63 examples/s]
Applying chat template to train dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4871/4871 [00:00<00:00, 38893.30 examples/s]
Tokenizing train dataset: 100%|█████████████████████████████████████████████████

{'loss': 0.6791, 'grad_norm': 0.911645770072937, 'learning_rate': 4.8645320197044334e-05, 'rewards/chosen': 0.06905547529459, 'rewards/rejected': 0.03922340273857117, 'rewards/accuracies': 0.7024999856948853, 'rewards/margins': 0.029832076281309128, 'logps/chosen': -45.35773849487305, 'logps/rejected': -49.17695236206055, 'logits/chosen': -0.46724268794059753, 'logits/rejected': -0.4864496886730194, 'epoch': 0.08210180623973727}
{'loss': 0.5751, 'grad_norm': 1.7376508712768555, 'learning_rate': 4.7276956759715385e-05, 'rewards/chosen': 0.27644240856170654, 'rewards/rejected': -0.013218633830547333, 'rewards/accuracies': 0.7975000143051147, 'rewards/margins': 0.2896610200405121, 'logps/chosen': -41.33415985107422, 'logps/rejected': -50.232948303222656, 'logits/chosen': -2.039604425430298, 'logits/rejected': -2.055771827697754, 'epoch': 0.16420361247947454}
{'loss': 0.4719, 'grad_norm': 3.441840171813965, 'learning_rate': 4.590859332238643e-05, 'rewards/chosen': 0.39632648229599, 'reward

## Analysis

We now analyze the benchmark results across all model sizes. With multiple trials, we can compute mean and standard deviation to understand the statistical reliability of our comparisons.

First, we flatten the nested profiles into a single DataFrame with one row per trial, then aggregate across trials to get mean and standard deviation.

In [9]:
dfs = []
for model_name, profiles in all_profiles.items():
    df = flatten_profiles(
        profiles,
        metric_accessors={
            "accuracy": ("MCQAAccuracy", "question_mean"),
            "positional_bias": ("MCQAPositionalBias", "mean"),
        }
    )
    df["model"] = model_name
    df["k_positive"] = get_param_values(df, "FewShot", "k_positive")
    dfs.append(df)

runs_df = pd.concat(dfs, ignore_index=True)
runs_df[["model", "pipeline", "trial_id", "k_positive", "accuracy", "positional_bias"]].head(15)

Unnamed: 0,model,pipeline,trial_id,k_positive,accuracy,positional_bias
0,Qwen2.5-0.5B-Instruct,baseline,0,,0.32,0.0652
1,Qwen2.5-0.5B-Instruct,baseline,1,,0.4,0.0744
2,Qwen2.5-0.5B-Instruct,baseline,2,,0.32,0.0692
3,Qwen2.5-0.5B-Instruct,baseline,3,,0.38,0.0788
4,Qwen2.5-0.5B-Instruct,baseline,4,,0.32,0.0736
5,Qwen2.5-0.5B-Instruct,few_shot_sweep,0,1.0,0.44,0.1168
6,Qwen2.5-0.5B-Instruct,few_shot_sweep,1,1.0,0.4,0.1044
7,Qwen2.5-0.5B-Instruct,few_shot_sweep,2,1.0,0.44,0.1008
8,Qwen2.5-0.5B-Instruct,few_shot_sweep,3,1.0,0.44,0.1092
9,Qwen2.5-0.5B-Instruct,few_shot_sweep,4,1.0,0.38,0.1096


In [10]:
# summarize by configuration (aggregate across trials)
summary_df = summarize_by_config(
    runs_df,
    metric_cols=["accuracy", "positional_bias"],
    group_cols=["model", "pipeline", "config_id"]
)

# add k_positive for few-shot rows
k_map = runs_df.groupby(["model", "pipeline", "config_id"])["k_positive"].first()
summary_df["k_positive"] = summary_df.apply(
    lambda row: k_map.get((row["model"], row["pipeline"], row["config_id"]), np.nan), axis=1
)

summary_df[["model", "pipeline", "k_positive", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].round(3)

Unnamed: 0,model,pipeline,k_positive,n_trials,accuracy_mean,accuracy_std,positional_bias_mean,positional_bias_std
0,Qwen2.5-0.5B-Instruct,baseline,,5.0,0.348,0.039,0.072,0.005
1,Qwen2.5-0.5B-Instruct,few_shot_sweep,1.0,5.0,0.42,0.028,0.108,0.006
2,Qwen2.5-0.5B-Instruct,few_shot_sweep,5.0,5.0,0.416,0.043,0.107,0.002
3,Qwen2.5-0.5B-Instruct,few_shot_sweep,10.0,5.0,0.444,0.026,0.103,0.003
4,Qwen2.5-0.5B-Instruct,few_shot_sweep,25.0,5.0,0.42,0.049,0.102,0.003
5,Qwen2.5-0.5B-Instruct,dpo_lora,,5.0,0.416,0.026,0.116,0.003
6,Qwen2.5-1.5B-Instruct,baseline,,5.0,0.76,0.014,0.012,0.004
7,Qwen2.5-1.5B-Instruct,few_shot_sweep,1.0,5.0,0.772,0.011,0.028,0.006
8,Qwen2.5-1.5B-Instruct,few_shot_sweep,5.0,5.0,0.76,0.014,0.03,0.006
9,Qwen2.5-1.5B-Instruct,few_shot_sweep,10.0,5.0,0.772,0.011,0.023,0.004


### FewShot scaling

We examine how FewShot accuracy scales with the number of positive examples. The baseline (unsteered) and DPO-LoRA results are shown as horizontal reference lines for comparison.

In [None]:
few_shot_df = summary_df[summary_df["pipeline"] == "few_shot_sweep"].copy()
few_shot_df = few_shot_df.sort_values(["model", "k_positive"])

# compute common axis limits for consistency across models
all_accuracy = runs_df["accuracy"].dropna()
ylim_accuracy = (max(0, all_accuracy.min() - 0.1), min(1, all_accuracy.max() + 0.1))

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 4))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    # data for this model
    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_trials = runs_df[(runs_df["model"] == short_name) & (runs_df["pipeline"] == "few_shot_sweep")]

    plot_sensitivity(
        swept=model_swept,
        metric="accuracy",
        sweep_col="k_positive",
        baseline=model_baseline,
        per_trial_data=model_trials,
        ax=ax,
        metric_label="accuracy",
        sweep_label="k_positive",
        title=short_name,
        ylim=ylim_accuracy,
        save_path=FIGURE_DIR / f"sensitivity_accuracy_{short_name}.png" if idx == 0 else None,
    )

    # add DPO-LoRA reference line
    dpo_row = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]
    if not dpo_row.empty:
        dpo_acc = dpo_row["accuracy_mean"].iloc[0]
        dpo_std = dpo_row["accuracy_std"].iloc[0]
        ax.axhline(dpo_acc, color="#E24A33", linestyle=":", linewidth=1.5, label="DPO-LoRA")
        ax.axhspan(dpo_acc - dpo_std, dpo_acc + dpo_std, color="#E24A33", alpha=0.1)
        ax.legend(frameon=False, loc="lower right", fontsize=8)

plt.show()

### Accuracy vs positional bias tradeoff

We examine whether there is a tradeoff between accuracy and positional bias across methods. The FewShot configurations are colored by `k_positive`, with the baseline shown as a black X marker and DPO-LoRA as a red square. The Pareto frontier indicates configurations that are not dominated by any other.

In [None]:
# compute common axis limits for consistency across models
all_accuracy = runs_df["accuracy"].dropna()
all_bias = runs_df["positional_bias"].dropna()
xlim_tradeoff = (max(0, all_accuracy.min() - 0.05), min(1, all_accuracy.max() + 0.05))
ylim_tradeoff = (max(0, all_bias.min() - 0.02), all_bias.max() + 0.02)

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 5))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    # data for this model
    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_trials = runs_df[(runs_df["model"] == short_name) & (runs_df["pipeline"] == "few_shot_sweep")]

    plot_tradeoff(
        swept=model_swept,
        x_metric="accuracy",
        y_metric="positional_bias",
        sweep_col="k_positive",
        baseline=model_baseline,
        per_trial_data=model_trials,
        ax=ax,
        x_label="accuracy",
        y_label="positional bias",
        sweep_label="k_positive",
        title=short_name,
        show_pareto=True,
        maximize_x=True,
        maximize_y=False,  # lower positional bias is better
        xlim=xlim_tradeoff,
        ylim=ylim_tradeoff,
        save_path=FIGURE_DIR / f"tradeoff_{short_name}.png" if idx == 0 else None,
    )

    # add DPO-LoRA marker
    dpo_row = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]
    if not dpo_row.empty:
        dpo_x = dpo_row["accuracy_mean"].iloc[0]
        dpo_y = dpo_row["positional_bias_mean"].iloc[0]
        dpo_x_err = dpo_row["accuracy_std"].iloc[0]
        dpo_y_err = dpo_row["positional_bias_std"].iloc[0]
        ax.errorbar(dpo_x, dpo_y, xerr=dpo_x_err, yerr=dpo_y_err, fmt="none", ecolor="#E24A33", elinewidth=0.5, capsize=2, capthick=0.5, zorder=6)
        ax.scatter(dpo_x, dpo_y, marker="s", s=100, c="#E24A33", edgecolors="white", linewidth=1, zorder=7, label="DPO-LoRA")
        ax.legend(frameon=False, loc="upper right", fontsize=8)

plt.show()

### Summary table

The table below summarizes all configurations ranked by accuracy, providing a comprehensive view of performance across methods and models.

In [None]:
# define method ordering (benchmark order)
method_order = ["baseline", "FewShot (k=1)", "FewShot (k=5)", "FewShot (k=10)", "FewShot (k=25)", "DPO-LoRA"]

summary_table = summary_df.copy()
summary_table["method"] = summary_table.apply(
    lambda row: "baseline" if row["pipeline"] == "baseline"
    else "DPO-LoRA" if row["pipeline"] == "dpo_lora"
    else f"FewShot (k={int(row['k_positive'])})",
    axis=1
)

# create ordering keys for proper sorting
model_order = [m.split("/")[-1] for m in MODELS]
summary_table["model_order"] = summary_table["model"].apply(lambda m: model_order.index(m) if m in model_order else len(model_order))
summary_table["method_order"] = summary_table["method"].apply(lambda m: method_order.index(m) if m in method_order else len(method_order))

display_df = summary_table.sort_values(["model_order", "method_order"])[
    ["model", "method", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]
].copy()
display_df.columns = ["model", "method", "trials", "accuracy (mean)", "accuracy (std)", "pos bias (mean)", "pos bias (std)"]

display_df.style.format({
    "accuracy (mean)": "{:.1%}",
    "accuracy (std)": "{:.1%}",
    "pos bias (mean)": "{:.3f}",
    "pos bias (std)": "{:.3f}",
}).background_gradient(subset=["accuracy (mean)"], cmap="RdYlGn")

## Takeaways

Few-shot prompting provides a lightweight alternative to fine-tuning for improving commonsense reasoning. Even a small number of positive examples can yield meaningful accuracy gains over the baseline. DPO-trained LoRA adapters offer comparable performance but require more upfront compute for training.