# Commonsense MCQA

This notebook benchmarks steering methods on the [CommonsenseQA](https://huggingface.co/datasets/tau/commonsense_qa) dataset. We compare the unsteered baseline, few-shot steering with varying numbers of examples, and a LoRA adapter trained with DPO across multiple model sizes. We use the `ControlSpec` functionality to sweep over the number of few-shot examples in order to study how the DPO-LoRA control compares to the few-shot control as we scale the number of examples.

## Setup

In [None]:
import os
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import transformers
from datasets import Dataset
from peft import PeftType

from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils import flatten_profiles, get_param_values, summarize_by_config

transformers.logging.set_verbosity_error()
os.chdir("./examples/notebooks/benchmark_commonsense_mcqa/")

MODELS = [
    # "Qwen/Qwen2.5-0.5B-Instruct",
    "Qwen/Qwen2.5-1.5B-Instruct",
    # "Qwen/Qwen2.5-3B-Instruct",
]

  from .autonotebook import tqdm as notebook_tqdm


## Building the use case

The use case of interest has already been constructed via the [use case](../../../docs/tutorials/add_new_use_case.md) tutorial and is available at `aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py`.

In [2]:
commonsense_mcqa = CommonsenseMCQA(
    evaluation_data="evaluation_qa.jsonl",
    evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
    num_shuffling_runs=20,
    num_samples=50
)

Two custom metrics have been created for the use case: `MCQAAccuracy` which measures the accuracy statistics of each question (across trials), and `MCQAPositionalBias` which measures the positional bias (via deviation from the uniform distribution across runs). To facilitate computation of these statistics, the use case accepts a keyword argument `num_shuffling_runs` dictating how many times each question should be presented to the (steered) model under a randomized ordering of the choices. We restrict the number of evaluation datapoints to `num_samples=50` for speed.

## Preparing the steering data

The benchmark uses steering data consisting of triples `(question, answer_chosen, answer_rejected)` extracted from the CommonsenseQA dataset.

In [3]:
with open("steer_qa.jsonl", "r") as f:
    steering_data = [json.loads(line) for line in f]

len(steering_data), steering_data[0]

(4871,
 {'id': '01beaf20-82aa-40b0-8b08-ee08b94e6666',
  'question': 'The spirit ascended to the after life, so what was it leaving?',
  'answer_chosen': 'human being',
  'answer_rejected': 'cemetary'})

For the `FewShot` control, we need to create example pools:

In [4]:
positive_pool = [{"question": row["question"], "answer": row["answer_chosen"]} for row in steering_data]
negative_pool = [{"question": row["question"], "answer": row["answer_rejected"]} for row in steering_data]

len(positive_pool), len(negative_pool)

(4871, 4871)

## Defining the controls

### FewShot with ControlSpec

Instead of using a fixed number of examples, we use `ControlSpec` to sweep over different values of `k_positive`. We fix `k_negative=0` to isolate the effect of positive examples.

In [5]:
few_shot_spec = ControlSpec(
    control_cls=FewShot,
    params={
        "selector_name": "random",
        "positive_example_pool": positive_pool,
        "negative_example_pool": negative_pool,
        "k_negative": 0,
    },
    vars=[{"k_positive": k} for k in [1, 5, 10, 25]],
    name="FewShot",
)

### DPO with LoRA

The DPO-LoRA control serves as our target to beat. It uses the same steering data to fine-tune a LoRA adapter. The hyperparameters below are tuned for the low-data regime (~5k examples): higher learning rate for faster convergence, lower beta for softer preference signals, and smaller LoRA rank to reduce overfitting.

In [6]:
train_ds = Dataset.from_list([
    {"prompt": row["question"], "chosen": row["answer_chosen"], "rejected": row["answer_rejected"]}
    for row in steering_data
])

In [7]:
def create_dpo_control(model_name: str) -> DPO:
    """Create a DPO control with model-specific output directory."""
    short_name = model_name.split("/")[-1]
    return DPO(
        train_dataset=train_ds,

        # DPO / TRL config (tuned for low-data regime)
        output_dir=f"trl_models/{short_name}-DPO-Lora-Steer",
        per_device_train_batch_size=4,
        num_train_epochs=3,
        learning_rate=5e-5,
        beta=0.05,
        loss_type="sigmoid",
        max_length=1024,
        max_prompt_length=512,
        disable_dropout=True,
        logging_steps=100,
        save_strategy="no",
        report_to="none",
        seed=123,

        # LoRA config (smaller rank to reduce overfitting)
        use_peft=True,
        peft_type=PeftType.LORA,
        r=8,
        lora_alpha=16,
        target_modules=["q_proj", "v_proj"],
        adapter_name="dpo",
        merge_lora_after_train=False,
    )

## Running the benchmark

The benchmark compares three steering approaches across multiple model sizes:
- **baseline**: Unsteered model
- **few_shot_sweep**: FewShot with varying `k_positive` (1, 5, 10, 25)
- **dpo_lora**: DPO-trained LoRA adapter

We run with `num_trials=5` to capture statistical variability across generation runs.

In [None]:
all_profiles = {}

for model_name in MODELS:
    short_name = model_name.split("/")[-1]
    print(f"Running benchmark for {short_name}")

    dpo_lora = create_dpo_control(model_name)

    benchmark = Benchmark(
        use_case=commonsense_mcqa,
        base_model_name_or_path=model_name,
        steering_pipelines={
            "baseline": [],
            "few_shot_sweep": [few_shot_spec],
            "dpo_lora": [dpo_lora],
        },
        gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
        device_map="auto",
        num_trials=5
    )

    profiles = benchmark.run()
    all_profiles[short_name] = profiles

    # export per-model results
    benchmark.export(profiles, save_dir=f"./profiles/{short_name}/")

Running benchmark for Qwen2.5-3B-Instruct
Running pipeline: baseline...


Loading checkpoint shards: 100%|█████████████████████████████████████| 2/2 [00:09<00:00,  4.99s/it]


done.
Running pipeline: few_shot_sweep...
Running configuration 1...
Running configuration 2...


## Analysis

We now analyze the benchmark results across all model sizes. With multiple trials, we can compute mean and standard deviation to understand the statistical reliability of our comparisons.

### Flatten and summarize results

First, we flatten the nested profiles into a single DataFrame with one row per trial, then aggregate across trials to get mean and standard deviation.

In [None]:
# flatten profiles from all models into a single DataFrame
dfs = []
for model_name, profiles in all_profiles.items():
    df = flatten_profiles(
        profiles,
        metric_accessors={
            "accuracy": ("MCQAAccuracy", "question_mean"),
            "positional_bias": ("MCQAPositionalBias", "mean"),
        }
    )
    df["model"] = model_name
    df["k_positive"] = get_param_values(df, "FewShot", "k_positive")
    dfs.append(df)

runs_df = pd.concat(dfs, ignore_index=True)
runs_df[["model", "pipeline", "trial_id", "k_positive", "accuracy", "positional_bias"]].head(15)

In [None]:
# summarize by configuration (aggregate across trials)
summary_df = summarize_by_config(
    runs_df,
    metric_cols=["accuracy", "positional_bias"],
    group_cols=["model", "pipeline", "config_id"]
)

# add k_positive for few-shot rows
k_map = runs_df.groupby(["model", "pipeline", "config_id"])["k_positive"].first()
summary_df["k_positive"] = summary_df.apply(
    lambda row: k_map.get((row["model"], row["pipeline"], row["config_id"]), np.nan), axis=1
)

summary_df[["model", "pipeline", "k_positive", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].round(3)

### Baseline accuracy by model size

We first examine how the unsteered baseline performance varies with model size.

In [None]:
baseline_df = summary_df[summary_df["pipeline"] == "baseline"].copy()
baseline_df["model_size"] = baseline_df["model"].map({
    "Qwen2.5-0.5B-Instruct": 0.5,
    "Qwen2.5-1.5B-Instruct": 1.5,
    "Qwen2.5-3B-Instruct": 3.0,
})
baseline_df = baseline_df.sort_values("model_size")

fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(
    baseline_df["model"],
    baseline_df["accuracy_mean"],
    yerr=baseline_df["accuracy_std"],
    capsize=5,
    color="steelblue",
    edgecolor="black"
)
ax.set_xlabel("model")
ax.set_ylabel("accuracy")
ax.set_ylim(0, 1.0)
ax.set_title("baseline accuracy by model size")
ax.tick_params(axis="x", rotation=15)
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

### FewShot scaling by model

Next, we examine how FewShot accuracy scales with the number of positive examples across different model sizes.

In [None]:
few_shot_df = summary_df[summary_df["pipeline"] == "few_shot_sweep"].copy()
few_shot_df = few_shot_df.sort_values(["model", "k_positive"])

fig, ax = plt.subplots(figsize=(10, 6))
colors = {"Qwen2.5-0.5B-Instruct": "C0", "Qwen2.5-1.5B-Instruct": "C1", "Qwen2.5-3B-Instruct": "C2"}

for model_name in MODELS:
    short_name = model_name.split("/")[-1]
    model_df = few_shot_df[few_shot_df["model"] == short_name]
    ax.errorbar(
        model_df["k_positive"],
        model_df["accuracy_mean"],
        yerr=model_df["accuracy_std"],
        fmt="o-",
        capsize=4,
        capthick=1.5,
        markersize=8,
        label=short_name,
        color=colors[short_name]
    )

ax.set_xlabel("number of positive examples (k_positive)")
ax.set_ylabel("accuracy")
ax.set_title("FewShot accuracy scaling by model size")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Cross-method comparison per model

We compare baseline, best FewShot configuration, and DPO-LoRA for each model.

In [None]:
# find best few-shot k for each model
best_few_shot = few_shot_df.loc[few_shot_df.groupby("model")["accuracy_mean"].idxmax()].copy()
best_few_shot["method"] = best_few_shot["k_positive"].apply(lambda k: f"FewShot (k={int(k)})")

# get DPO results
dpo_df = summary_df[summary_df["pipeline"] == "dpo_lora"].copy()
dpo_df["method"] = "DPO-LoRA"

# get baseline
baseline_df_plot = baseline_df.copy()
baseline_df_plot["method"] = "baseline"

# combine for plotting
comparison_data = []
for model_name in [m.split("/")[-1] for m in MODELS]:
    baseline_row = baseline_df_plot[baseline_df_plot["model"] == model_name].iloc[0]
    dpo_row = dpo_df[dpo_df["model"] == model_name].iloc[0]
    fs_row = best_few_shot[best_few_shot["model"] == model_name].iloc[0]

    comparison_data.append({"model": model_name, "method": "baseline", "accuracy_mean": baseline_row["accuracy_mean"], "accuracy_std": baseline_row["accuracy_std"]})
    comparison_data.append({"model": model_name, "method": fs_row["method"], "accuracy_mean": fs_row["accuracy_mean"], "accuracy_std": fs_row["accuracy_std"]})
    comparison_data.append({"model": model_name, "method": "DPO-LoRA", "accuracy_mean": dpo_row["accuracy_mean"], "accuracy_std": dpo_row["accuracy_std"]})

comparison_df = pd.DataFrame(comparison_data)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

models = [m.split("/")[-1] for m in MODELS]
x = np.arange(len(models))
width = 0.25

# group by method type
baseline_vals = comparison_df[comparison_df["method"] == "baseline"].set_index("model").loc[models]
dpo_vals = comparison_df[comparison_df["method"] == "DPO-LoRA"].set_index("model").loc[models]
fs_vals = comparison_df[comparison_df["method"].str.startswith("FewShot")].set_index("model").loc[models]

ax.bar(x - width, baseline_vals["accuracy_mean"], width, yerr=baseline_vals["accuracy_std"], capsize=4, label="baseline", color="gray", edgecolor="black")
ax.bar(x, fs_vals["accuracy_mean"], width, yerr=fs_vals["accuracy_std"], capsize=4, label="best FewShot", color="steelblue", edgecolor="black")
ax.bar(x + width, dpo_vals["accuracy_mean"], width, yerr=dpo_vals["accuracy_std"], capsize=4, label="DPO-LoRA", color="coral", edgecolor="black")

ax.set_xlabel("model")
ax.set_ylabel("accuracy")
ax.set_title("accuracy comparison across steering methods")
ax.set_xticks(x)
ax.set_xticklabels(models, rotation=15)
ax.legend()
ax.set_ylim(0, 1.0)
ax.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

### Accuracy vs positional bias tradeoff

We examine whether there is a tradeoff between accuracy and positional bias across methods.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, model_name in enumerate([m.split("/")[-1] for m in MODELS]):
    ax = axes[idx]
    model_summary = summary_df[summary_df["model"] == model_name]

    # baseline
    bl = model_summary[model_summary["pipeline"] == "baseline"]
    ax.scatter(bl["accuracy_mean"], bl["positional_bias_mean"], marker="X", s=150, c="red", edgecolors="black", zorder=4, label="baseline")
    ax.errorbar(bl["accuracy_mean"], bl["positional_bias_mean"], xerr=bl["accuracy_std"], yerr=bl["positional_bias_std"], fmt="none", color="red", alpha=0.5, capsize=2)

    # DPO
    dpo = model_summary[model_summary["pipeline"] == "dpo_lora"]
    ax.scatter(dpo["accuracy_mean"], dpo["positional_bias_mean"], marker="s", s=120, c="coral", edgecolors="black", zorder=3, label="DPO-LoRA")
    ax.errorbar(dpo["accuracy_mean"], dpo["positional_bias_mean"], xerr=dpo["accuracy_std"], yerr=dpo["positional_bias_std"], fmt="none", color="coral", alpha=0.5, capsize=2)

    # FewShot (color by k)
    fs = model_summary[model_summary["pipeline"] == "few_shot_sweep"].sort_values("k_positive")
    scatter = ax.scatter(fs["accuracy_mean"], fs["positional_bias_mean"], c=fs["k_positive"], cmap="viridis", s=100, edgecolors="black", zorder=2, label="FewShot")
    for _, row in fs.iterrows():
        ax.errorbar(row["accuracy_mean"], row["positional_bias_mean"], xerr=row["accuracy_std"], yerr=row["positional_bias_std"], fmt="none", color="gray", alpha=0.4, capsize=2)
        ax.annotate(f"k={int(row['k_positive'])}", (row["accuracy_mean"], row["positional_bias_mean"]), textcoords="offset points", xytext=(5, 5), fontsize=8)

    ax.set_xlabel("accuracy")
    ax.set_ylabel("positional bias")
    ax.set_title(model_name)
    ax.legend(loc="upper right", fontsize=8)
    ax.grid(True, alpha=0.3)

plt.suptitle("accuracy vs positional bias tradeoff", y=1.02)
plt.tight_layout()
plt.show()

### Summary table

The final summary table ranks all configurations by accuracy.

In [None]:
summary_table = summary_df.copy()
summary_table["method"] = summary_table.apply(
    lambda row: "baseline" if row["pipeline"] == "baseline"
    else "DPO-LoRA" if row["pipeline"] == "dpo_lora"
    else f"FewShot (k={int(row['k_positive'])})",
    axis=1
)

display_df = summary_table[["model", "method", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].copy()
display_df.columns = ["model", "method", "trials", "accuracy (mean)", "accuracy (std)", "pos bias (mean)", "pos bias (std)"]
display_df = display_df.sort_values(["model", "accuracy (mean)"], ascending=[True, False])

display_df.style.format({
    "accuracy (mean)": "{:.1%}",
    "accuracy (std)": "{:.1%}",
    "pos bias (mean)": "{:.3f}",
    "pos bias (std)": "{:.3f}",
}).background_gradient(subset=["accuracy (mean)"], cmap="RdYlGn")

## Takeaways

